Closed raphCode closed 2 years ago
Merging #175 (a14abdc) into master (84276b9) will decrease coverage by
0.27%
. The diff coverage is45.45%
.:exclamation: Current head a14abdc differs from pull request most recent head 0c0bc83. Consider uploading reports for the commit 0c0bc83 to get more accurate results
@@ Coverage Diff @@
## master #175 +/- ##
==========================================
- Coverage 64.73% 64.45% -0.28%
==========================================
Files 17 17
Lines 621 633 +12
==========================================
+ Hits 402 408 +6
- Misses 219 225 +6
Impacted Files | Coverage Δ | |
---|---|---|
src/args.rs | 0.00% <ø> (ø) |
|
src/scraper.rs | 24.74% <45.45%> (+1.66%) |
:arrow_up: |
I thought a bit more, and what we want is separate regex filters for downloading and visiting:
To put in other words:
--in/exclude-download
arguments represents the traditional behavior like before the PR: visiting all pages, downloading what matches.--in/exclude-download
with --visit-filter-is-download-filter
behaves like the initial commit in this PR: not visiting pages that won't be downloaded--in/exclude-download
with --in/exclude-visit
. Of course, only visited pages can be downloaded.If this is merged, I advise to use a squash commit to hide the intermediate development.
I always fail to think of useful tests. How would you check that some links are not visited?
I could just duplicate the existing tests in filters.rs
and make them use the visit filters instead of the download filters, relying on the side effect that unvisited links are not downloaded as well.
But I think this does not capture the distinction that is intended with those options.
How would you check that some links are not visited?
You could put valid link (that should be downloaded) to visit in a page that should not be visited and check if the valid link has been downloaded or not.
Basically just duplicated the existing tests and adjusted for visit filter regex.
Any update on this? It's almost a bug without the feature at this point, --exclude and --include are not doing what they are supposed to 😢
Sorry for the delay, I will review this tomorrow
Humm the CI is not running
We are in this case... https://github.community/t/missing-approve-and-run-button/200572 I will merge this in the meantime, hopefully Github will fix this
The CI works perfectly on master, all good :rocket: :fire:
This commit speeds up scraping for scenarios where pages have a high branch factor, that is many links and a majority of these links is excluded by the --exclude / --include rules. This also improves memory usage in these scenarios since the links are not stored. Also the network traffic is reduced by not downloading these links in the first place.