ArchiveTeam / wpull

Wget-compatible web downloader and crawler.
GNU General Public License v3.0
545 stars 77 forks source link

Unexpected recursion due to parsing HTML on an expected script #447

Open JustAnotherArchivist opened 4 years ago

JustAnotherArchivist commented 4 years ago

AB job 2wnp9udy4kgtw7uoaa3wybav3 just ran into an interesting recursion issue: the URL list contains https://bit.ly/1aD99Xo, which resolves to https://www.bookpeople.com/event/kenny-rogers-meet-greet-what-are-chances. From there, wpull went to https://www.bookpeople.com/event/misc/jquery.js as extracted from Drupal's JS block listing scripts. On this page, wpull expected a script (link_type = "javascript"), but the response is a 404 HTML page (as the actual script lies at /misc/jquery.js instead). This is where things blew up: despite expecting a script, it parsed this response as HTML and treated all links on it as inline resources, recursing further. Initially, it seems much saner to only invoke the JS parser (if applicable) on expected scripts. However, this could be complicated if the same URL appears twice during a crawl, once in a script context and once as a link. Further, "javascript" in this context includes anything URL-like discovered inside a script block, and of course there can be URLs for HTML pages in such blocks on script-heavy websites. So I'm not sure if it's worth or even possible to fix this.