Open marchellodev opened 3 years ago
Thanks for reporting this. Regex might be the way to go indeed. If you are interested to do it, feel free to assign yourself and open a PR :smiley:
Thanks for such an amazing piece of software!
Thanks a lot, this means a lot to @CohenArthur and I
@Skallwar I'd love to!
It seems like find_urls_as_strings()
returns mutable list of all the urls, that you can change, and this change will be instantly reflected in dom via the kuchiki
library. I try to select every <script>
element and then find urls inside via regex. I am not sure how I can do the same here - return the mutable string that will be attached to the dom.
I think we should refactor the code a bit. Instead of returning all the strings, then filtering them out, and then changing them to the paths of the downloaded files, we should probably first find single url, check if it belongs to the domain, and then change it right away, all in the same method, even before adding the url to the queue. This will offer much more flexibility for getting urls not only from HTML, but CSS, JS, and other types of files. What do you think?
Currently, the main issue that prevents from downloading large websites with a lot of JavaScript is the suckit's inability to scrape links from JS and CSS as well as from HTML.
I think JS (as well as CSS) is too complicated to parse, so we could just use regular expressions to find urls, add them to the queue, and then replace them with the local format.
Thanks for such an amazing piece of software!
Related to #68 #70