Improve Dockerfile to include phantomjs + make more lightweight
Allow snapshot API to load from_remote (fetch latest snapshot available on github), from_path (local directory), from_zip (local zip file) and from_tar (local tar.bz2 archive)
Allow to fetch several documents for a given domain
Identified and improved performance bottlenecks in fetcher (lang detection now uses faster library, better leverage async capabilities of aiohttp, makes text extraction faster)
Improved find_policies (make use of a pool of process workers to parallelize the heavy lifting), the bottleneck still is finding URLs using beautiful soup.
from_remote
(fetch latest snapshot available on github),from_path
(local directory),from_zip
(local zip file) andfrom_tar
(localtar.bz2
archive)