Skallwar / suckit

Suck the InTernet
Apache License 2.0
735 stars 38 forks source link

Scraping URLs from JavaScript and CSS #142

Open marchellodev opened 3 years ago

marchellodev commented 3 years ago

Currently, the main issue that prevents from downloading large websites with a lot of JavaScript is the suckit's inability to scrape links from JS and CSS as well as from HTML.

I think JS (as well as CSS) is too complicated to parse, so we could just use regular expressions to find urls, add them to the queue, and then replace them with the local format.

Thanks for such an amazing piece of software!

Related to #68 #70

Skallwar commented 3 years ago

Thanks for reporting this. Regex might be the way to go indeed. If you are interested to do it, feel free to assign yourself and open a PR :smiley:

Thanks for such an amazing piece of software!

Thanks a lot, this means a lot to @CohenArthur and I

marchellodev commented 3 years ago

@Skallwar I'd love to!

It seems like find_urls_as_strings() returns mutable list of all the urls, that you can change, and this change will be instantly reflected in dom via the kuchiki library. I try to select every <script> element and then find urls inside via regex. I am not sure how I can do the same here - return the mutable string that will be attached to the dom.

I think we should refactor the code a bit. Instead of returning all the strings, then filtering them out, and then changing them to the paths of the downloaded files, we should probably first find single url, check if it belongs to the domain, and then change it right away, all in the same method, even before adding the url to the queue. This will offer much more flexibility for getting urls not only from HTML, but CSS, JS, and other types of files. What do you think?

Skallwar commented 3 years ago

I'm note sure regex are the way to go. That's slow and matching url and relative url will be quite hard.

I think there a some good CSS parser out there based on servo this one or this one

Js might be more tricky