gchq / sleeper

A cloud-native, serverless, scalable, cheap key-value store
Apache License 2.0
62 stars 9 forks source link

Make Fargate and EC2 compaction not mutually exclusive at deploy time #2936

Open patchwork01 opened 1 month ago

patchwork01 commented 1 month ago

Background

The system test CompactionOnEC2IT takes a very long time to run because it needs a separate instance of Sleeper deployed just for this test.

This is because the Fargate and EC2 compaction launch types have separate ECS task definitions, and only one is ever deployed at a time in a Sleeper instance by CompactionStack in the CDK.

Description

We'd like to avoid the need to deploy multiple instances of Sleeper to handle the different compaction launch types.

Analysis

We can just deploy both ECS task definitions in CompactionStack.

The EC2 task definition comes with an autoscaling capacity provider, but it looks like this scales to zero by default. We can confirm whether there's any additional cost to always including this in the CompactionStack deployment.

System test

If we deploy both compaction launch types at once, we could consider whether we can convert CompactionOnEC2IT to use the main system test instance. If we can, we could move it to the quick system test suite.

If we want to be able to run compaction tasks on either Fargate or EC2, in system tests running against the same instance, with the same instance properties, we would need a way to override the launch type instance property. We would also need to somehow compensate for the fact that there's just one compaction job queue, and any compaction tasks can take jobs from the queue, whether they're on Fargate or EC2.

We want to run system tests in parallel at some point. It seems like in order to support system tests running on Fargate and EC2 at the same time in the same instance, we would need to change how tasks are run quite a lot, in a way that would be very different to how they run now. We could keep the test running on a separate instance, or rethink how we support both launch types at once.

Supporting both launch types at once

If we wanted to support both launch types at the same time, we may need a setup more similar to ingest, which has multiple ingest job queues, one for each type of ingest or bulk import. Otherwise, we can't control which launch type will be used for which job.

If we deploy both launch types in the same instance at once, and try to switch between them, it's likely there would be overlap. If we don't explicitly shut down all tasks of the old launch type, they'll keep running and we'll have a mix of both. If they share the same compaction job queue, as they do now, the old tasks would keep picking up fresh jobs from the queue. We can consider how we should manage this.

With it as it is now, where we have to redeploy to change launch type, that may be highly disruptive to a running instance. We could check what will happen when you redeploy while compaction tasks are running. It may force all the tasks of the old launch type to stop, or may fail if it can't stop them. There may be a period where we can't run tasks of either type while it deploys. It seems likely to be valuable to improve on this.

We can split out a separate issue for terminating tasks that are running on a launch type that is no longer desired.

m09526 commented 1 week ago

The EC2 task definition comes with an autoscaling capacity provider, but it looks like this scales to zero by default.

If it doesn't scale to zero, then there is definitely a bug. It was designed to terminate all EC2 instances that aren't doing anything to make it true scale to zero.