ArchiveTeam / ArchiveBot

ArchiveBot, an IRC bot for archiving websites
http://www.archiveteam.org/index.php?title=ArchiveBot
MIT License
353 stars 72 forks source link

Pollution in archives from web server running on localhost #391

Open JustAnotherArchivist opened 5 years ago

JustAnotherArchivist commented 5 years ago

Several pipelines are also running a web server on the same machine without blocking the ArchiveBot wpull processes from accessing that web server via either localhost or 127.0.0.0/8. Since ignores only apply to initial request URLs, not to redirect targets, those web server pages have been retrieved under http://localhost/ and other URLs repeatedly. The pipeline should attempt to retrieve that in the pre-flight check and bark if it doesn't get a connection refusal or similar.

JustAnotherArchivist commented 4 years ago

402 partially solves this by checking a few common ports. I'll leave this issue open until all existing pipelines are either verified webserver-free (or access blocked with iptables or similar) or upgraded.