Skallwar / suckit

Suck the InTernet
Apache License 2.0
735 stars 38 forks source link

Introduces external depth (#74) & a few fixes (incl. #69) #146

Closed marchellodev closed 2 years ago

marchellodev commented 3 years ago

Motivation

A lot of modern websites rely on external domains (usually referred to as cnd domains) for their css, js, images, and other resources. Since SuckIT does not yet support downloading data from external domains (except for the bug when //en.wikipedia.org is treated as a relative path, (which I fixed)), it is impossible to properly download big and complex websites (#74).

Also, this patch fixes panic when trying to parse urls like ///tools.wmflabs.org/, which returns Empty host error. I encountered this trying to download wikipedia. So, I think this PR should also close #69

Notes

I almost have no experience with Rust, and I haven't yet implemented tests for the changes (I'm not really sure what is the best way to do this). So, please look at the code with extra scrutiny :). However, I have tested it on a few websites, and everything seems to work properly.

Also, --edepth (external depth) does not have a shortcut, since -e is used for excluding pattern. I'm not sure how this parameter should be renamed in order for shortcut to exist

Skallwar commented 3 years ago

You need to fix the coding style (use rustfmt)

Skallwar commented 3 years ago

Please rebase on top of master and squash into one or two commit

Skallwar commented 2 years ago

@marchellodev Can you still work on this? If not, I can rebase and add tests for you

marchellodev commented 2 years ago

@Skallwar I would really appreciate it! :) I tried to do that a few days ago, but I always stumbled upon some errors. I'm kinda new to git, especially to those sophisticated operations

Thanks again!

codecov[bot] commented 2 years ago

Codecov Report

Merging #146 (64af90d) into master (1be1f85) will increase coverage by 0.07%. The diff coverage is 52.17%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #146      +/-   ##
==========================================
+ Coverage   62.54%   62.62%   +0.07%     
==========================================
  Files          16       17       +1     
  Lines         558      610      +52     
==========================================
+ Hits          349      382      +33     
- Misses        209      228      +19     
Impacted Files Coverage Δ
src/args.rs 0.00% <ø> (ø)
src/disk.rs 0.00% <ø> (ø)
src/scraper.rs 12.28% <0.00%> (-1.54%) :arrow_down:
src/downloader.rs 72.89% <100.00%> (ø)
tests/external.rs 100.00% <100.00%> (ø)
Skallwar commented 2 years ago

@CohenArthur are you ok with this?

Skallwar commented 2 years ago

@marchellodev Thanks again, excellent work