ArchiveTeam / ArchiveBot

ArchiveBot, an IRC bot for archiving websites
http://www.archiveteam.org/index.php?title=ArchiveBot
MIT License
359 stars 71 forks source link

Detect when nodes are banned from certain sites #173

Open hannahwhy opened 9 years ago

hannahwhy commented 9 years ago

People have shoved enough stuff through ArchiveBot that we're now running into a problem where certain high-usage fetch nodes are being banned from large website providers. One of them appears to be banned from all Squarespace sites.

There's a similar problem with nodes that are started on heavily filtered networks. There's some protection against this with the CheckIP task, but we can't cover all the bases there. For example, we had a node in Singapore that would have been unable to grab anything that fell under the censorship list of the Singapore Media Development Authority.

It would be nice to identify when it looks like a node cannot complete a job due to these conditions and send out an alert. With suspend/resume it would then be possible to move that job to a different node.

hannahwhy commented 9 years ago

One way to do this might be to start jobs on multiple pipelines on (e.g.) different ASes, but the current job dequeuing setup doesn't allow that.

Perhaps some sort of prefetch check task might work: if the first few responses are 4xx or network timeouts, the job is failed and put back into the queue.