ArchiveTeam / grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Other
1.32k stars 130 forks source link

grab-site 2.x upgrade guide #135

Closed ivan closed 5 years ago

ivan commented 5 years ago

grab-site 2.x uses ludios/wpull for much faster HTML parsing using html5-parser, and also implements faster ignore-matching using the re2 module. These were the two major bottlenecks identified by pyflame and flamegraph.

grab-site processes should now reconnect reliably to gs-server after gs-server goes down and reappears. Please let me know if this is not the case.

Upgrade guide

Follow the new install instructions in the README, which now require installing libxml2/libxslt/re2/pkg-config and either building Python 3.7.x with pyenv or getting it from brew.

If you have custom ignore patterns, replace {primary_netloc} with {any_start_netloc}.

Support for {primary_url} in ignore patterns was removed because it was not used anywhere, but I can add it back. Please let me know if you were using it.

phantomjs support was removed in ludios/wpull; for browser-based crawls, use something else like crocoite, brozzler, or webrecorder.io.

grab-site --custom-hooks=... was removed due to major changes in the wpull 2.0 plugin interface; if you were using this, you can edit libgrabsite/wpull_hooks.py in your installation. Please let me know if you were using --custom-hooks.

Because grab-site is now faster, you may need to add a delay for ban-happy sites.

Please report any bugs, I probably missed something.