grab-site 2.x uses ludios/wpull for much faster HTML parsing using html5-parser, and also implements faster ignore-matching using the re2 module. These were the two major bottlenecks identified by pyflame and flamegraph.
grab-site processes should now reconnect reliably to gs-server after gs-server goes down and reappears. Please let me know if this is not the case.
Upgrade guide
Follow the new install instructions in the README, which now require installing libxml2/libxslt/re2/pkg-config and either building Python 3.7.x with pyenv or getting it from brew.
If you have custom ignore patterns, replace {primary_netloc} with {any_start_netloc}.
Support for {primary_url} in ignore patterns was removed because it was not used anywhere, but I can add it back. Please let me know if you were using it.
phantomjs support was removed in ludios/wpull; for browser-based crawls, use something else like crocoite, brozzler, or webrecorder.io.
grab-site --custom-hooks=... was removed due to major changes in the wpull 2.0 plugin interface; if you were using this, you can edit libgrabsite/wpull_hooks.py in your installation. Please let me know if you were using --custom-hooks.
Because grab-site is now faster, you may need to add a delay for ban-happy sites.
Please report any bugs, I probably missed something.
grab-site 2.x uses ludios/wpull for much faster HTML parsing using
html5-parser
, and also implements faster ignore-matching using there2
module. These were the two major bottlenecks identified by pyflame and flamegraph.grab-site
processes should now reconnect reliably togs-server
aftergs-server
goes down and reappears. Please let me know if this is not the case.Upgrade guide
Follow the new install instructions in the README, which now require installing libxml2/libxslt/re2/pkg-config and either building Python 3.7.x with pyenv or getting it from brew.
If you have custom ignore patterns, replace
{primary_netloc}
with{any_start_netloc}
.Support for
{primary_url}
in ignore patterns was removed because it was not used anywhere, but I can add it back. Please let me know if you were using it.phantomjs support was removed in ludios/wpull; for browser-based crawls, use something else like crocoite, brozzler, or webrecorder.io.
grab-site --custom-hooks=...
was removed due to major changes in the wpull 2.0 plugin interface; if you were using this, you can editlibgrabsite/wpull_hooks.py
in your installation. Please let me know if you were using--custom-hooks
.Because grab-site is now faster, you may need to add a delay for ban-happy sites.
Please report any bugs, I probably missed something.