Skallwar / suckit

Suck the InTernet

Apache License 2.0

735 stars 38 forks source link

Filter links before downloading / adding to the queue #175

Closed raphCode closed 2 years ago

raphCode commented 2 years ago

This commit speeds up scraping for scenarios where pages have a high branch factor, that is many links and a majority of these links is excluded by the --exclude / --include rules. This also improves memory usage in these scenarios since the links are not stored. Also the network traffic is reduced by not downloading these links in the first place.

codecov[bot] commented 2 years ago

Codecov Report

Merging #175 (a14abdc) into master (84276b9) will decrease coverage by 0.27%. The diff coverage is 45.45%.

:exclamation: Current head a14abdc differs from pull request most recent head 0c0bc83. Consider uploading reports for the commit 0c0bc83 to get more accurate results

@@            Coverage Diff             @@
##           master     #175      +/-   ##
==========================================
- Coverage   64.73%   64.45%   -0.28%     
==========================================
  Files          17       17              
  Lines         621      633      +12     
==========================================
+ Hits          402      408       +6     
- Misses        219      225       +6

Impacted Files	Coverage Δ
src/args.rs	`0.00% <ø> (ø)`
src/scraper.rs	`24.74% <45.45%> (+1.66%)`	:arrow_up:

raphCode commented 2 years ago

I thought a bit more, and what we want is separate regex filters for downloading and visiting:

specify what to download while still exploring and finding all links
exclude any irrelevant links (speeding up crawling, better memory footprint etc)

To put in other words:

just using the proposed --in/exclude-download arguments represents the traditional behavior like before the PR: visiting all pages, downloading what matches.
using --in/exclude-download with --visit-filter-is-download-filter behaves like the initial commit in this PR: not visiting pages that won't be downloaded
anything in between can be specified by using --in/exclude-download with --in/exclude-visit. Of course, only visited pages can be downloaded.

If this is merged, I advise to use a squash commit to hide the intermediate development.

raphCode commented 2 years ago

I always fail to think of useful tests. How would you check that some links are not visited?

I could just duplicate the existing tests in filters.rs and make them use the visit filters instead of the download filters, relying on the side effect that unvisited links are not downloaded as well. But I think this does not capture the distinction that is intended with those options.

Skallwar commented 2 years ago

How would you check that some links are not visited?

You could put valid link (that should be downloaded) to visit in a page that should not be visited and check if the valid link has been downloaded or not.

raphCode commented 2 years ago

Basically just duplicated the existing tests and adjusted for visit filter regex.

telugu-boy commented 2 years ago

Any update on this? It's almost a bug without the feature at this point, --exclude and --include are not doing what they are supposed to 😢

Skallwar commented 2 years ago

Sorry for the delay, I will review this tomorrow

Skallwar commented 2 years ago

Humm the CI is not running

Skallwar commented 2 years ago

We are in this case... https://github.community/t/missing-approve-and-run-button/200572 I will merge this in the meantime, hopefully Github will fix this

Skallwar commented 2 years ago

The CI works perfectly on master, all good :rocket: :fire: