ArchiveTeam / ArchiveBot

ArchiveBot, an IRC bot for archiving websites
http://www.archiveteam.org/index.php?title=ArchiveBot
MIT License
357 stars 72 forks source link

requesting non-existent url #462

Closed Billybangleballs closed 4 years ago

Billybangleballs commented 4 years ago

Direct from my site log, everything goes swimmingly and then it requests /xyz.js which doesn't exist and gets a 404. There is a file of that name, but not in the root, and the html of the / is correct and doesn't refer to xyz.js as being under /. There is obviously a parsing error happening somewhere in the bot's code. I won't quote the html in this report, as it is publicly available to anyone that wants to look at it. It has to be the bot, or the log would be full of 404 lines generated by other users, and it isn't.

JustAnotherArchivist commented 4 years ago

That's a known problem and unfortunately impossible to solve without heavy machinery.

What happens here is that wpull tries to extract URLs from script blocks. Anything that looks like it could be a relative URL will be extracted. In this case, that includes the string xyz.js from g.type='text/javascript'; g.async=true; g.defer=true; g.src=u+'xyz.js'; s.parentNode.insertBefore(g,s);. Getting the actual URL, //www.myintarweb.co.uk/analyticks/xyz.js, is virtually impossible without running an actual JavaScript engine, which effectively means running the archival with a browser (as you'll also need the DOM, cookies, and various other things). That's possible (brozzler, crocoite, and others), but it's very resource-intensive and doesn't scale well.

Sorry for the noise, and thanks for the issue. I don't think this has been documented anywhere previously, although it comes up all the time. I guess most website operators don't check their 404 logs.

Billybangleballs commented 4 years ago

Thanks for the explanation, I can work with that, a dummy /xyz.js will cure the 404 issue at my end without really breaking anything.