Skallwar / suckit

Suck the InTernet
Apache License 2.0
733 stars 38 forks source link

Remove fragment hash from scraped URLs #99

Closed pjsier closed 3 years ago

pjsier commented 3 years ago

Scraping a web page with multiple links to specific sections on another page (i.e. "/#section") results in duplicate downloads of the same page because the URL fragment is different. According to the fragment method docs this portion of the URL isn't typically sent to the server, and in my understanding it would only make a difference in client-side updates that wouldn't be tracked here anyway.

Should the URL fragment be removed from the URL to avoid duplicates? If so I can put in a PR for that

CohenArthur commented 3 years ago

Sure, that's a good catch. Feel free to open a PR :+1: