[ECS]: ECS Stateful Services

Akramio commented 5 years ago

This issue, related to stateful services, complements issue #64 regarding ECS/EBS native integration. The proposal is to introduce a new type of ECS Services in which each task is allocated a unique identifier that persists even if the task dies and is replaced.

Background ECS users have expressed the need to deploy stateful workloads on ECS. For example, in #64 , customers would like to have native integration between ECS and EBS so that ECS Tasks are automatically and dynamically attached to an EBS volume.

Stateless containerized applications have traditionally been deployed as services with 'fungible' tasks, meaning that the different instantiations of a single application are interchangeable. To deploy these workloads, ECS Services allow customers to run and maintain a specified number of instantiations of the same Task simultaneously.

Certain workloads however require each Task within a service to play a specific role. This is particularly true for some stateful workloads, in which specific Tasks play a special role such as ‘primary’ or ‘leader’.

Potential Feature Proposal I am opening this issue to gather use-cases and +1s on a potential feature that would introduce ECS Services for stateful workloads, in which each Task gets assigned an identifier. In this potential scenario, if the Task dies and a new replacing Task is started by ECS , the same identifier, volume, and Service Discovery / CloudMap name will be allocated to the new task.

As we research this potential feature, if this will be helpful to you please +1 and provide some details on any use-cases you may have.

chrismccracken commented 5 years ago

This would create single-AZ pinned tasks due to the EBS vol's residency, no? Multi-AZ failover would have to be architected independent of this at the task provisioning time even having host instances in multi az?

FernandoMiguel commented 5 years ago

@christopherhein some people seem to want pets

jdub commented 5 years ago

I have a use case for Task-associated EBS volumes on ECS, via AWS Batch. I hope it's relevant to your question.

I have Batch jobs that download some data (between 50GB and 1TB), process it, and upload the results. Each job requires sufficient storage space to download the data, and in some cases, the same amount again during processing.

This means I have to:

configure the ECS cluster instances with an EBS volume large enough to handle N maximum storage size jobs per instance (easy if you specify a single instance type, hard if you want any size in a class)
provide user data for cluster instances to format and mount the volume
configure mounts and volumes for the job definitions so each task has access to the EBS storage
deal with the inevitable job failures due to lack of space because arithmetic isn't our strong suit

If instead I could specify that a task requires X GBs of attached EBS storage, and that was exposed via Batch's job definitions and overrides – exactly the same as CPU and RAM today – then I'd have an EBS volume of precisely the right size per job. Perfect!

(Thanks again for trying this approach to open feedback!)

francispereira commented 5 years ago

+1

We run a custom database on ECS. Each customer is allocated a dedicated database. Depending of the customer's data size, the memory allocated to the task is different. One task + service represents a customer's database. For now, we pin tasks to a container instance where the customer's data is stored. The ability to associate a EBS volume with a task would help binpack container instances and provide the agility we need to move from one instance to another.

JacobASeverson commented 5 years ago

+1

I was considering opening a related issue for assigning an ordinal (or identifier including an ordinal) for each task in a service. A use case is for clients within a task application to be able to identify themselves across restarts. Other platforms such as Cloud Foundry and Kubernetes expose this through environment variables and hostnames, respectively.

simplesteph commented 5 years ago

+1 this would allow running Kafka on ECS very easily. Even better, please combine that with fargate so you get serverless and ops less stateful containers

luyun-aa commented 5 years ago

+1 We have a ECS cluster for microservices. Right now we are looking for AWS native solution to containerize our offline Spark SQL jobs. While EKS looks good for it, it would be nicer to use one technology to handle them all so that we could focus development resource on one solution.

kevinkreiser commented 5 years ago

:+1:

We've got a few scenarios where our services depend on large-ish (100s-1000s of GB) read-only datasets. These can be pulled at task startup from s3 but this essentially nukes our ability to auto-scale (because we have to wait for the data to land before servicing requests).

A feature in which we could spin up a task to prime the pump (ie preload an EBS volume) so that future tasks could start and just mount the same EBS volume read-only from new/other tasks would allow us to properly auto-scale. Failing that a mechanism in which we pre-bake a set number of EBS volumes so that we have capacity for an auto scale event would be less slick but just as effective if its easier to implement.

In any case, thanks for thinking about this!

Akramio commented 5 years ago

@kevinkreiser do you mind my asking why does the EBS volume need to be pre-provisioned before you launch your Task? Would it work if you could provide a snapshot-id and ECS would create an EBS for that Task based on that snapshot you provided?

Akramio commented 5 years ago

@simplesteph interesting that you mention Kafka in the context of Fargate. Do you see any challenges or concerns on Fargate with stateful apps like Kafka given that you don't have access to an actual virtual machine? For example, you currently can't run privileged containers on Fargate.

kevinkreiser commented 5 years ago

@Akramio yes, providing a snapshot id would be perfectly fine so long as that mechanism is relatively quick even in the presence of snapshots that are several 100 of gigabytes in size. Are you saying that this is already a possibility or is this the most likely path of implementing the feature?

Akramio commented 5 years ago

@kevinkreiser we're still exploring what is the best way to implement this but it seems like not having to pre-create an EBS volume is easier (assume EBS volumes can be created 'on the fly' based on a snapshot fast enough).

pbecotte commented 5 years ago

Was just wondering if anyone had any ideas for a workaround for this? I want to deploy three tasks in a service...zookeeper-1, zookeeper-2, and zookeeper-3, attached to vol1, vol2, and vol3 respectively. Then, on a deploy, I would like zookeeper-1 to go down and the new instance of zookeeper-1 to attach to the same vol1.

I did imagine running three services as a workaround- but couldn't find any way to do a rolling deploy of a group of services using cloudformation or terraform?

markusschaber commented 5 years ago

We have two usecases, currently:

Running RabbitMQ (both single-instance and multi-instance multi-AZ HA clusters). Each Instance needs its permanent storage, which must survive container recreations. An EBS volume bound to the service instance would be best for this.
Shared Data Backend. Some services (instance number scales with load) need to write and read data which the user uploads. We're currently using EFS for this. (As we need durability, FSx seems not like a good fit.). We're thinking about writing S3 backends for this use case, however we're currently using some 3rd party libraries in some cases which assume File System interfaces.

vaibhavzo commented 5 years ago

This would be really nice to have. We have also encountered a use case in setting up stateful Prometheus instances. EBS volumes that are mapped to the service, which are associated to the new ec2 instance upon recreation would be ideal.

youwalther65 commented 5 years ago

+1 Our customer wants a 5x10 solution for its custom made Shopping Cart application with data on SQL database and local volumes. , ECS Fargate tasks scheduled for 5x10 with persistent re-attachable EBS volumes would fit great and would even give the customer a cost reduction .

larrywax commented 4 years ago

+1 we would like to containerize our elasticsearch cluster and stop messing around with ansible playbooks. We could achieve this with eks but would be amazing to maintain only our ECS infrastructure

bhavintr commented 4 years ago

we are planning to stateful application which hosting LDAP Service and storing LDAP directory Locally on each Task ( we are planning to use EBS as LDAP directory required Block Level Storage and also do Block Level replication if any changes detect in Directory data on one of the Running Task .to fulfill this requirement we want to make sure if task is terminated existing EBS Volume which host data must be able to reattach it self automatically and dynamically to new task . How we can accomplish this solution ?

markmsmith commented 4 years ago

Our first use case for this is running a pair of Prometheus monitoring hosts, each with a Thanos sidecar. The monitor host retains a sliding window of the last 2 hours of data, before it's compacted and shipped off to S3 by Thanos. In order to have 2 instances for HA, and not lose the last 2 hours of data every time we deploy, we currently have to run these as EC2 instances and have scripts to ensure host1 gets EBS volume1 reattached and host2 gets volume2 reattached (very similar to the Zookeeper case above).

We would love to run these as ECS Fargate tasks and just have it ensure that a) there's only ever 1 instance using a given volume, b) each instance sticks with the same one and gets its own volume c) ideally we could do rolling deploys for zero downtime.

Our 2nd use case is running a singleton Thanos Compactor service, which requires fairly large (100GB) volume, which currently forces us to go EC2 to get an EBS volume that's large enough. If we could instead mount a large enough ephemeral volume, we could run this in Fargate as well.

deuscapturus commented 4 years ago

We would need this to run a neo4j cluster on ECS. "core" cluster members elect a leader.

If ECS delivers this it should solve the problem Kubernetes suffers from with StatefulSets, which requires a defined headless service for each expected pod in the StatefulSet. As such Kubernetes StatefulSets are unable to freely autoscale.

tstibbs commented 4 years ago

I'm less clear about the need for properly stateful workloads, but having EBS volumes that can persist between runs of a task (either by literally keeping the volume around or by snapshotting and restoring) would enable Fargate to be used in a number of cases where it can't currently. Most other problems can be worked around (e.g. by third-party service discovery mechanisms).

Use-case 1: We want to run developer environments in fargate, using something like Theia or cdr-server (both basically provide something similar to visual studio code spaces). When the developer has finished working, you'd obviously want to shut down the container (for cost reasons), but you clearly need to keep their files around. The files don't necessarily need to stay on the volume, they could be moved off to S3 or EFS, but this needs to be handled in fargate (not the container image) to ensure they get persisted correctly in the event that the container dies prematurely (e.g. due to out of memory killer).

Use-case 2: I currently have an EC2 running a nexus instance in a docker container, storing about 3Tb of artifacts (on a bind mounted EBS volume). If the container dies on the EC2 for some reason, docker will start it again and all the files will still be on disk and most likely everything will just work. Even if the EC2 disappears for some reason, I can just fire up another instance, attach the data volume, pull/run my container and everything is good again. However in fargate all the data would be lost (even if you could go above the 20Gb storage limit), so you'd want to ensure it's stored on an external medium, so it was available when a replacement task was fired up. Snapshotting or transferring to EFS is probably not desirable given the size of the volume, so in this case you'd probably want the EBS volume to just sit there waiting to be attached to the next task.

nickpoorman commented 3 years ago

Any service that has "high-availability" via replication (such as RAFT) that needs to persist to disk is going to require the ability for a volume to be "detached" when a container goes down and "reattached" when it comes back up. The recent EFS support could help with this. However, I still need a stable identifier for the container to implement a solution.

The ideal solution would allow me to automatically scale up ECS containers. They would get their own persistent volume (or identifier so I can use EFS) from a pool. If the pool is empty, one should be created. That way when I scale up, the container is attached to a volume from the pool. When they scale down or restart, the volume goes back into the pool.

cvejanovic commented 3 years ago

We want to run redis stream consumers as an ECS service (Fargate). We need each worker process (container) to have a unique but persistent identity (consumer name) in order to handle worker crashes/restarts gracefully. Kubernetes StatefulSets serve this purpose well but there appears to be no alternative or workaround in ECS and Fargate.

vibhav-ag commented 3 years ago

Long-delayed follow-up: we are looking more closely into this feature request. As we dig deeper, I had a follow up question: what is the update & deployment behavior that you would like to see for these applications?

acdha commented 3 years ago

Long-delayed follow-up: we are looking more closely into this feature request. As we dig deeper, I had a follow up question: what is the update & deployment behavior that you would like to see for these applications?

Most of my needs have fallen into two broad categories:

Completely ephemeral data: caches, processing space, replicated storage, etc. where it would be completely reasonable to say each container starts with a new freshly-formatted volume and any initialization will be handled by the application.
Long-lived data which could be rebuilt but where it would be relatively expensive to do so and delay the time it takes a task to come online. In general this would all fit a broad pattern where I'd like to have a task which is replacing another task inherit the partition but I'm not sure trying to implement that would be worth breaking the conceptual simplicity of tasks coming up and health-checking before the older tasks are drained.

The situations where this comes up for me have been where I have some kind of synchronization mechanism available so one possible option could be something allowing tasks to start with a volume from a snapshot, so startup delay could be limited to the time it takes to synchronize only the changes since the last snapshot. The main goal I'd have here would be something like minimizing the chance of every task in the service restarting at the same time under normal conditions so, for example, something like ZooKeeper wouldn't lose quorum due to a task definition update.

tstibbs commented 3 years ago

Long-delayed follow-up: we are looking more closely into this feature request. As we dig deeper, I had a follow up question: what is the update & deployment behavior that you would like to see for these applications?

@vibhav-ag although there are a few different use-cases for this feature, for me the update and deploy behaviour will probably always be the same, regardless of whether it's an update (i.e. new version of the container) or a replacement of an unhealthy container. I'd expect something like this (not sure if this is the level of detail you were after):

Fargate forcibly kills any containers that are still running (e.g. containers which are running but have failing health checks).
On-disk state is 'captured' (e.g. by detaching the EBS data volume from the container or by snapshotting its contents into EBS snapshots or S3).
New container created.
On-disk state is recreated and attached to new container (e.g. by attaching the EBS data volume, or creating a new one from the snapshot or S3 backup).
New container runs, any persistent unique IDs are injected (though arguably these could be stored in the persistent storage).
Service Discovery / CloudMap is updated to point to this container.

Note that I'm clearly saying that I expect all the old/unhealthy containers to die before the new ones start up. In principal the possibility to health-check the new container before switching over would be useful in the case of updates, but in reality if the new container hasn't been provided with all the persistent state from the running container, how useful will the health check really be? You could check it with some dummy data, or with a snapshot of the data from the running container, but then you probably did that in dev already. The unique id requirement suggests that the particular application isn't capable of having multiple containers running and serving user requests at the same time, so it really seems that there's limited use in trying to health check the new container. In the case of an update, if the update doesn't behave as expected, you could roll back to deploy a new container based on an old version of the image fairly easily anyway.

Note also that from reading the comments on this ticket, it does look as if there might be a couple of competing use-cases that might need handling separately? On the one hand, some people don't need any persistent storage, they just need a way to have a unique, persistent id injected into the running containers. Others however actually need persistent storage (in which case there's no real need to have an external mechanism to inject persistent IDs because you could just store it on disk). However, if all you want is the persistent id, then having to allocate/manage an entire volume just to store an id would probably be annoying, so there might not be a single solution for all the use-cases in this thread.

vibhav-ag commented 3 years ago

@acdha @tstibbs Thank you both for the detailed replies on the update/deployment behavior- it is really helpful as we think through the problems here. A couple of follow ups for you: @acdha : For the Zookeeper case, you seem to be suggesting some way to be able to customize the number of Tasks being updated at a time? @tstibbs I had one slightly tangential question based on your response- how important is the unique task-identifier per se? Is it primarily for service discovery? Would it be necessary even if we could find a way to, say, preserve the Task ENI?

tstibbs commented 3 years ago

@tstibbs I had one slightly tangential question based on your response- how important is the unique task-identifier per se? Is it primarily for service discovery? Would it be necessary even if we could find a way to, say, preserve the Task ENI?

@vibhav-ag I don't personally have a use-case for persistent ids; if Cloud Map could be automatically updated to point to the new container then that would suffice, and it doesn't sound like I'd need a persistent IP or ENI in that case. Other people on this thread do seem to have a use-case for persistent ids (e.g. https://github.com/aws/containers-roadmap/issues/127#issuecomment-723889319) but it's not obvious to me that preserving the IP or ENI would help very much in those cases either.

vibhav-ag commented 3 years ago

@tstibbs I had one slightly tangential question based on your response- how important is the unique task-identifier per se? Is it primarily for service discovery? Would it be necessary even if we could find a way to, say, preserve the Task ENI?

@vibhav-ag I don't personally have a use-case for persistent ids; if Cloud Map could be automatically updated to point to the new container then that would suffice, and it doesn't sound like I'd need a persistent IP or ENI in that case. Other people on this thread do seem to have a use-case for persistent ids (e.g. #127 (comment)) but it's not obvious to me that preserving the IP or ENI would help very much in those cases either.

Thanks @tstibbs - I may have misphrased it. I meant to say, would a persistent IP address for the Task be a reasonable alternative to cloud map pointing to the same Task for your use case? We are trying to evaluate this approach as an alternative to the service discovery/cloud map approach. If not, it would be great if you could clarify why.

mpoindexter commented 3 years ago

At my company we run several stateful services on ECS (including Prometheus and the ELK stack). Here's how we do it:

we do this on self-managed instances, not Fargate
we've written a Docker volume driver that manages a pool of EBS volume leases. The leases are tracked in DynamoDB. We specify the use of this driver in our ECS task definition. On task startup, the driver will allocate an unclaimed volume from the pool, take out a lease, and attach it to the task. When the task terminates, the lease will be released, and the volume is eligible for claim by another task.
the driver takes care of running some init tasks when a volume is first checked out from the pool. These tasks include doing the initial formatting, setting permissions, and potentially dropping a file on the volume that contains a unique identifier for this slice of state.
we use the healthy percents to manage rollout. For example with prometheus, we keep at least one node up at all times, with ELK we ensure that < "replication factor" tasks are out of service at any point.
we use the unique file dropped onto each volume to manage task identity. This ensures that the identity of a task for cluster membership, etc is picked up from whatever slice of state gets attached to it.
we use placement constraints/strategies to ensure that our tasks get placed properly across AZs

acdha commented 3 years ago

@acdha : For the Zookeeper case, you seem to be suggesting some way to be able to customize the number of Tasks being updated at a time?

Here's a little more background: I was working on an Apache Solr deployment. That uses a modest amount of storage but since it takes about an hour to rebuild the index I don't want to have to do that every time someone pushes a new task definition. It'd be really nice if there was a way to have that scenario handled — i.e. preserving a 1:1 set of EBS volumes when the containers are updated — and if that existed that would also make it really handy (but not critical for this application) to have what you asked about so I could have, say, 7 nodes in a cluster and allow 2 at a time to be upgraded so there's always a quorum running & if something goes wrong with the EBS volume and/or auto-scaling happens to occur a task starting with a clean volume can sync the index from its peers rather than rebuilding from the source.

pbecotte commented 3 years ago

It's probably not super helpful, but if you replicated the way statefulsets work innkubernetes, you'd probably solve 99% of use cases. There may be value in iterating from scratch vs starting from an existing thing, but at a minimum consider the things that are built using statefulsets and persistent EBS volumes.

As an aside, it's still amazing to me that kubernetes has built in EBS support, while Amazon's own container service does not.

tstibbs commented 3 years ago

Thanks @tstibbs - I may have misphrased it. I meant to say, would a persistent IP address for the Task be a reasonable alternative to cloud map pointing to the same Task for your use case? We are trying to evaluate this approach as an alternative to the service discovery/cloud map approach. If not, it would be great if you could clarify why.

@vibhav-ag understood, and agreed, I'm struggling to think of a case in normal update/deploy scenarios when that wouldn't be ok. Certainly it feels like native cloud map integration would be more flexible (for example it would allow for migrating a task between subnets) but I agree it's probably not a hard requirement (for my use-cases at least).

acdha commented 5 months ago

@vibhav-ag does #64 shipping mean this is closer to happening? We still have various services (use-cases above) where persistent volumes which can outlive a container deployment would be very useful.

vibhav-ag commented 5 months ago

@acdha Yes, this is actively under consideration. Are Apache ZooKeeper and Apache Solr still your primary use cases? If there are others, it would be helpful if you could share those too.

acdha commented 5 months ago

@acdha Yes, this is actively under consideration. Are Apache ZooKeeper and Apache Solr still your primary use cases? If there are others, it would be helpful if you could share those too.

Those are my primary needs but I've asked some colleagues to chime in as well. In normal usage, having a fractional deployment where we could be confident that only n% of the running containers would be replaced at a time (gated by health checks, etc.) would probably address most of the need but that could still get ugly in some kind of DR scenario if all of the volumes were lost within a short time period.

dleavitt commented 3 weeks ago

I've got a use-case that somewhat resembles @tstibbs but simpler. We've got a semi-ephemeral database - a database attached to a container that doesn't need to be especially durable or available but does need to be able to restore its contents after a restart/update of its task.

It's part of a "review apps" feature for web application development. For each feature branch in our project, we spin up a self-contained application environment that includes both the application code and supporting infrastructure, including a database. Each environment is a single service with a single task instance, each of which has a few containers including one for the database. The database contents are stored on a mounted ebs volume.

Ideally when we update the task, we'd detach the volume and reattach it to the new task.

I think this could also be accomplished by creating a snapshot of the volume and restoring it onto the new task's volume, but I'm not sure there's a good way to accomplish this either?

aws / containers-roadmap

[ECS]: ECS Stateful Services #127