requesting non-existent url

Billybangleballs commented 4 years ago

Direct from my site log, everything goes swimmingly and then it requests /xyz.js which doesn't exist and gets a 404. There is a file of that name, but not in the root, and the html of the / is correct and doesn't refer to xyz.js as being under /. There is obviously a parsing error happening somewhere in the bot's code. I won't quote the html in this report, as it is publicly available to anyone that wants to look at it. It has to be the bot, or the log would be full of 404 lines generated by other users, and it isn't.

192.99.9.75 - - [08/Sep/2020:16:48:05 +0100] "GET / HTTP/1.1" 200 22693 "https://www.raspberrypi.org/blog/new-product-raspberry-pi-3-model-a/" "ArchiveTeam ArchiveBot/20200413.2e71c9a (wpull 2.0.3) and not Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36"
192.99.9.75 - - [08/Sep/2020:16:48:15 +0100] "GET /images/1610.jpg HTTP/1.1" 200 77507 "https://www.myintarweb.co.uk/" "ArchiveTeam ArchiveBot/20200413.2e71c9a (wpull 2.0.3) and not Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36"
192.99.9.75 - - [08/Sep/2020:16:48:16 +0100] "GET /favicon-16x16.png HTTP/1.1" 200 2373 "https://www.myintarweb.co.uk/" "ArchiveTeam ArchiveBot/20200413.2e71c9a (wpull 2.0.3) and not Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36"
192.99.9.75 - - [08/Sep/2020:16:48:17 +0100] "GET /author/steve/mm.jpg HTTP/1.1" 200 19656 "https://www.myintarweb.co.uk/" "ArchiveTeam ArchiveBot/20200413.2e71c9a (wpull 2.0.3) and not Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36"
192.99.9.75 - - [08/Sep/2020:16:48:18 +0100] "GET /favicon-32x32.png HTTP/1.1" 200 7171 "https://www.myintarweb.co.uk/" "ArchiveTeam ArchiveBot/20200413.2e71c9a (wpull 2.0.3) and not Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36"
192.99.9.75 - - [08/Sep/2020:16:48:18 +0100] "GET /android-chrome-512x512.png HTTP/1.1" 200 122008 "https://www.myintarweb.co.uk/" "ArchiveTeam ArchiveBot/20200413.2e71c9a (wpull 2.0.3) and not Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36"
192.99.9.75 - - [08/Sep/2020:16:48:19 +0100] "GET /banner.png HTTP/1.1" 200 200272 "https://www.myintarweb.co.uk/" "ArchiveTeam ArchiveBot/20200413.2e71c9a (wpull 2.0.3) and not Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36"
192.99.9.75 - - [08/Sep/2020:16:48:20 +0100] "GET /safari-pinned-tab.svg HTTP/1.1" 200 22540 "https://www.myintarweb.co.uk/" "ArchiveTeam ArchiveBot/20200413.2e71c9a (wpull 2.0.3) and not Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36"
192.99.9.75 - - [08/Sep/2020:16:48:21 +0100] "GET /apple-touch-icon.png HTTP/1.1" 200 20423 "https://www.myintarweb.co.uk/" "ArchiveTeam ArchiveBot/20200413.2e71c9a (wpull 2.0.3) and not Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36"
192.99.9.75 - - [08/Sep/2020:16:48:22 +0100] "GET /images/misc.jpg HTTP/1.1" 200 57159 "https://www.myintarweb.co.uk/" "ArchiveTeam ArchiveBot/20200413.2e71c9a (wpull 2.0.3) and not Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36"
192.99.9.75 - - [08/Sep/2020:16:48:23 +0100] "GET /images/tech.jpg HTTP/1.1" 200 65706 "https://www.myintarweb.co.uk/" "ArchiveTeam ArchiveBot/20200413.2e71c9a (wpull 2.0.3) and not Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36"
192.99.9.75 - - [08/Sep/2020:16:48:23 +0100] "GET /xyz.js HTTP/1.1" 404 21333 "https://www.myintarweb.co.uk/" "ArchiveTeam ArchiveBot/20200413.2e71c9a (wpull 2.0.3) and not Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36"

JustAnotherArchivist commented 4 years ago

That's a known problem and unfortunately impossible to solve without heavy machinery.

What happens here is that wpull tries to extract URLs from script blocks. Anything that looks like it could be a relative URL will be extracted. In this case, that includes the string xyz.js from g.type='text/javascript'; g.async=true; g.defer=true; g.src=u+'xyz.js'; s.parentNode.insertBefore(g,s);. Getting the actual URL, //www.myintarweb.co.uk/analyticks/xyz.js, is virtually impossible without running an actual JavaScript engine, which effectively means running the archival with a browser (as you'll also need the DOM, cookies, and various other things). That's possible (brozzler, crocoite, and others), but it's very resource-intensive and doesn't scale well.

Sorry for the noise, and thanks for the issue. I don't think this has been documented anywhere previously, although it comes up all the time. I guess most website operators don't check their 404 logs.

Billybangleballs commented 4 years ago

Thanks for the explanation, I can work with that, a dummy /xyz.js will cure the 404 issue at my end without really breaking anything.

ArchiveTeam / ArchiveBot

requesting non-existent url #462