ArchiveTeam / ArchiveBot

ArchiveBot, an IRC bot for archiving websites
http://www.archiveteam.org/index.php?title=ArchiveBot
MIT License
356 stars 72 forks source link

analyzer, log-firehose, or trimmer crash may result in a functional deadlock (aka. postmortem on the 2021-08-05 incident) #519

Open JustAnotherArchivist opened 3 years ago

JustAnotherArchivist commented 3 years ago

Last night, various things on the control node crashed. I haven't been able to establish the exact chain of events, but I believe that first either Redis or the trimmer had some sort of minor issue, which caused the trimmer to die with an error. This meant that the job logs accumulated, and eventually Redis was slaughtered by the OOM killer. Unfortunately, the RDB file already had these accumulated logs, which meant that restarting Redis would simply lead to another OOM kill immediately.

As I understand it, this could also happen if the trimmer was still working fine but the analyzer or the log-firehose crashed, because that would also break trimming (via no longer updating last_analyzed_log_entry and last_broadcasted_log_entry, respectively).

I'm not sure what the solution for this is – other than redesigning the entire log system so that messages don't go into Redis in the first place (which is planned).


I'll also use this to document the fix:

  1. Stop the/most pipelines' SSH connections to prevent them from immediately spamming Redis with further log lines.
  2. Stop anything else memory-intensive that's still running and not really needed (dashboard, websocket, cogs).
  3. Restart Redis and hope that it doesn't OOM. (If it does, free more RAM or temporarily increase swap, I guess?)
  4. Run the analyzer and the firehose manually for all jobs.
    • This is to update the two fields mentioned above so that the trimmer can do its job. The analyzer will do its usual thing, the firehose will send the log messages into the void (since the dashboard WebSocket server isn't running), but that's fine.
    • The normal way to run these is with updates-listener, but that wouldn't work because the pipelines are disconnected, so no job IDs are being pushed to the updates channel.
    • This grepping for job IDs is obviously not perfect, but it should at least trim it down enough to get out of the OOM zone as most job IDs are 24 or 25 characters long.
    • redis-cli keys '*' | grep -P '^[0-9a-z]{24,25}$' | REDIS_URL=redis://127.0.0.1:6379/0 plumbing/analyze-logs
    • redis-cli keys '*' | grep -P '^[0-9a-z]{24,25}$' | REDIS_URL=redis://127.0.0.1:6379/0 FIREHOSE_SOCKET_URL=tcp://127.0.0.1:12345 plumbing/log-firehose
    • Naturally, it should be possible to just update the broadcasted key directly, but I didn't look into how to do that.
  5. Run the trimmer manually: redis-cli keys '*' | grep -P '^[0-9a-z]{24,25}$' | REDIS_URL=redis://127.0.0.1:6379/0 plumbing/trim-logs >/dev/null