gchq / sleeper

A cloud-native, serverless, scalable, cheap key-value store
Apache License 2.0
62 stars 9 forks source link

Gracefully stop compaction ECS tasks when upgrading Sleeper instance #640

Open patchwork01 opened 1 year ago

patchwork01 commented 1 year ago

This is split from https://github.com/gchq/sleeper/issues/578.

When we upgrade an instance of Sleeper, there may be compaction or splitting compaction tasks that run constantly with a never-ending stream of jobs. When we upgrade an instance to a new version, we'd like to terminate and re-launch the running tasks to upgrade them to the new version.

This should be done with minimal impact on the system. Tasks should show has having finished in the system reports (which they currently won't if they're terminated halfway). Any updates or commits in SQS, DynamoDB or S3 should be fully applied or fully reverted. We should make sure the system will get to a consistent state within 30 seconds after ECS sends it a terminate signal. The tasks should terminate promptly without the need for ECS to send a kill signal.

We should also invoke lambdas to create new compaction and splitting compaction tasks. Any jobs that were being processed by running tasks should have been released back to the SQS queue, so that the lambda function can start tasks appropriately.

patchwork01 commented 1 year ago

Another option would be to pause the Lambda that creates the compaction jobs, wait for the tasks to finish the current jobs, then unpause the job creation. We could use a rule that would run in the background to check regularly whether the tasks have finished and turn the job creation back on.

That wouldn't work for ingest because the jobs are generated outside the system.