Skallwar / suckit

Suck the InTernet
Apache License 2.0
728 stars 40 forks source link

Support to pull URLs from CSS #213

Closed dsgallups closed 1 year ago

dsgallups commented 1 year ago

This PR fixes #68 , #212 , #142

I did not add a test for this new code. I tested this by pulling scraping resources from https://www.purdue.edu/itap/iamo/p/login.css with a depth of 1. This can be tried by executing:

cargo run https://www.purdue.edu/itap/iamo/p/login.css --depth 1

I have added a new struct named FileType which handles html, css, or other. This is a drop in replacement for the previously used is_html function in parser.rs.

Edit: There is a slight issue, where if the request for the CSS file is 302, and that CSS file has a relative URL path, then the parser will not update this accordingly. Looking to fix.

Edit 2: Auditing this bug, it appears that other files will suffer from not being to download relative dependencies on a 302 redirect. Perhaps this should go into another PR? (e.g. request /dir1/dir2/hello.html 302 -> /dir1/hello.html, hello.html contains dependency mystuff.css, where its real path is at /dir1/mystuff.css)

Skallwar commented 1 year ago

Thanks you very much! I won't have time to review this in the next couple of days, but be sure that I'm really excited about this and I will give it a spin and a review as soon as I can!

codecov[bot] commented 1 year ago

Codecov Report

Merging #213 (a965cfa) into master (bcfbacf) will decrease coverage by 1.73%. The diff coverage is 10.93%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #213      +/-   ##
==========================================
- Coverage   30.86%   29.13%   -1.73%     
==========================================
  Files          18       18              
  Lines        3279     3501     +222     
==========================================
+ Hits         1012     1020       +8     
- Misses       2267     2481     +214     
Impacted Files Coverage Δ
src/lib.rs 19.41% <ø> (-1.08%) :arrow_down:
src/response.rs 27.31% <ø> (ø)
src/scraper.rs 19.43% <0.00%> (-5.06%) :arrow_down:
src/downloader.rs 69.36% <53.84%> (-2.33%) :arrow_down:
dsgallups commented 1 year ago

Another item is the use of __querystring__ in line 13 of url_helper.rs. This may panic on an instance of a request where the parameter utilizes special characters like = or ;...maybe I should adjust for url encoding? e.g.: thread '<unnamed>' panicked at 'Couldn't create folder \...some directory\fonts.googleapis.com\css__querystring__family=Archivo+Narrow:400,700: The directory name is invalid. (os error 267)', logger.rs:42:9

Skallwar commented 1 year ago

maybe I should adjust for url encoding?

That's a good solution

dsgallups commented 1 year ago

Unfortunately I've become way too busy to get back to this. I hope to revisit this in the future! Such a useful tool...if I get a chance, I'll be sure to open another pr.

Skallwar commented 1 year ago

No problem, I totally get it. You went way further than I never did with this. Thanks you very much :)

Skallwar commented 6 months ago

I gave this a spin again. I managed to fix an old issue where books.toscrape.com did not have the star ratings because we where missing a font so I'm quite happy with that.

I'm trying to fix this:

Another item is the use of querystring in line 13 of url_helper.rs. This may panic on an instance of a request where the parameter utilizes special characters like = or ;...maybe I should adjust for url encoding? e.g.: thread '' panicked at 'Couldn't create folder ...some directory\fonts.googleapis.com\cssquerystringfamily=Archivo+Narrow:400,700: The directory name is invalid. (os error 267)', logger.rs:42:9

but cargo run https://www.purdue.edu/itap/iamo/p/login.css --depth 1 does not seems to provide this panick anymore