Closed dsgallups closed 1 year ago
Thanks you very much! I won't have time to review this in the next couple of days, but be sure that I'm really excited about this and I will give it a spin and a review as soon as I can!
Merging #213 (a965cfa) into master (bcfbacf) will decrease coverage by
1.73%
. The diff coverage is10.93%
.
@@ Coverage Diff @@
## master #213 +/- ##
==========================================
- Coverage 30.86% 29.13% -1.73%
==========================================
Files 18 18
Lines 3279 3501 +222
==========================================
+ Hits 1012 1020 +8
- Misses 2267 2481 +214
Impacted Files | Coverage Δ | |
---|---|---|
src/lib.rs | 19.41% <ø> (-1.08%) |
:arrow_down: |
src/response.rs | 27.31% <ø> (ø) |
|
src/scraper.rs | 19.43% <0.00%> (-5.06%) |
:arrow_down: |
src/downloader.rs | 69.36% <53.84%> (-2.33%) |
:arrow_down: |
Another item is the use of __querystring__
in line 13
of url_helper.rs
. This may panic on an instance of a request where the parameter utilizes special characters like =
or ;
...maybe I should adjust for url encoding?
e.g.: thread '<unnamed>' panicked at 'Couldn't create folder \...some directory\fonts.googleapis.com\css__querystring__family=Archivo+Narrow:400,700: The directory name is invalid. (os error 267)', logger.rs:42:9
maybe I should adjust for url encoding?
That's a good solution
Unfortunately I've become way too busy to get back to this. I hope to revisit this in the future! Such a useful tool...if I get a chance, I'll be sure to open another pr.
No problem, I totally get it. You went way further than I never did with this. Thanks you very much :)
I gave this a spin again. I managed to fix an old issue where books.toscrape.com did not have the star ratings because we where missing a font so I'm quite happy with that.
I'm trying to fix this:
Another item is the use of querystring in line 13 of url_helper.rs. This may panic on an instance of a request where the parameter utilizes special characters like = or ;...maybe I should adjust for url encoding? e.g.: thread '
' panicked at 'Couldn't create folder ...some directory\fonts.googleapis.com\cssquerystringfamily=Archivo+Narrow:400,700: The directory name is invalid. (os error 267)', logger.rs:42:9
but cargo run https://www.purdue.edu/itap/iamo/p/login.css --depth 1
does not seems to provide this panick anymore
This PR fixes #68 , #212 , #142
I did not add a test for this new code. I tested this by pulling scraping resources from https://www.purdue.edu/itap/iamo/p/login.css with a depth of 1. This can be tried by executing:
cargo run https://www.purdue.edu/itap/iamo/p/login.css --depth 1
I have added a new struct named
FileType
which handles html, css, or other. This is a drop in replacement for the previously usedis_html
function inparser.rs
.Edit: There is a slight issue, where if the request for the CSS file is 302, and that CSS file has a relative URL path, then the parser will not update this accordingly. Looking to fix.
Edit 2: Auditing this bug, it appears that other files will suffer from not being to download relative dependencies on a 302 redirect. Perhaps this should go into another PR? (e.g. request /dir1/dir2/hello.html 302 -> /dir1/hello.html, hello.html contains dependency mystuff.css, where its real path is at /dir1/mystuff.css)