aws / containers-roadmap

This is the public roadmap for AWS container services (ECS, ECR, Fargate, and EKS).
https://aws.amazon.com/about-aws/whats-new/containers/
Other
5.2k stars 316 forks source link

[ECS] [Container image resolution]: Allow feature to be disabled (or make it opt-in) #2393

Open jakauppila opened 1 month ago

jakauppila commented 1 month ago

Community Note

Tell us about your request It was announced on 7/11/2024 that for any services created or updated after June 25, 2024 within Amazon ECS that container image tags would be resolved to the image digest and will be used going forward to ensure software version consistency.

This change in behavior was not communicated, was not opt-in behavior, or even gated with a new Fargate platform version.

We relied on the previous behavior by pointing application-defined Task Definitions to centralized managed sidecar images that leveraged mutable tags so that when a new version is pushed, any consuming task definitions will immediately start using it without requiring a deployment by hundreds or thousands of applications.

Which service(s) is this request for? ECS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? We were leveraging the previous ability to point at mutable container image tags to roll-out centrally managed sidecars without action needed by our application developer customers.

Are you currently working around this issue? To resolve the problem of failing applications, we had to restore the old container images to ECR with the SHA that was previously resolved to; historically we have purged the old when we push the new.

Additional context What's New: https://aws.amazon.com/about-aws/whats-new/2024/07/amazon-ecs-software-version-consistency-containerized-applications/ Blog Post: https://aws.amazon.com/blogs/containers/announcing-software-version-consistency-for-amazon-ecs-services/ Documentation: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/deployment-type-ecs.html#deployment-container-image-stability

danielferraz-git commented 1 month ago

Hi,

I'd like to emphasize the importance of the requested feature to disable the new ECS image tag resolution behavior. This change has disrupted our deployment strategy, which relies on using the latest tag for blue-green and rolling updates.

The flexibility of using mutable tags allowed us to manage deployments without extra steps. This ECS change has increased our operational overhead, requiring additional deployment steps for every update.

I'd like to request an option to disable this new functionality at the service, cluster, or account level, allowing us to maintain our current deployment process.

danielferraz-git commented 1 month ago

For now, I believe the suggested workaround should be officially documented: #2402.

DevAssis commented 1 month ago

Great! It's good to know that.

DevAssis commented 1 month ago

Great! It's good to know that.

pmcevoy commented 1 month ago

Got caught by this today when one of our tasks needed to restart due to memory overload and then eventually was killed cos restart was unable to download a datadog sidecar image that we were referencing in TaskDefinition by floating tag, but the tag had moved to a new version and the old image had been purged (we host copies of datadog in our own ECR). I hate this new feature - I'm compentent enough to use unique buildserver assigned tags for containers that count, but when I decide to use a floating tag (eg based on SemVer) I understand that I may have small internal inconsistencies that I accept. At the very least, allow us to override this new default...

vibhav-ag commented 1 month ago

Cross-posting the message I posted on issue #2394. Sorry for the late response on this thread- we're aware of the impact this change has had and apologize for the churn this rollout has created. We've been actively working through the set of issues that have been highlighted on this thread and have 2 updates to share: 1/for customers who've been impacted by the lack of ability to see image tag information, we're working on a change that will bring back image tag information in the describe-tasks response, in the same format as was available prior to the release of version consistency (i.e image:tag). An important thing to keep in mind here is that if you run docker ps on the host, you will see the image in format image:tag but docker inspect will return image:tag@digest. 2/ We're also working on adding a configuration in the container definition that will allow you to opt-out of digest resolution for specific containers within your task- this should address both customers who want to completely opt out of digest resolution as well as customers who want to disable resolution for specific sidecar containers. I'll be using this issue to share updates on the change to disable digest resolution for specific containers and issue #2394 for updates on the change to bring back image tag information. We're tracking both changes at high-priority.   Once again, we regret the churn this change has caused you all. While we still believe version consistency is the right behavior for the vast majority of applications hosted on ECS, we fully acknowledge that we could have done a better job socializing these changes and addressing these issues before, rather than after making the change.

matdelong commented 1 week ago

Could you please provide an estimate for when this work will be complete? I echo the feelings voiced in https://github.com/aws/containers-roadmap/issues/2394 that the "software version consistency" feature wasn't rolled out properly, and should be reverted until this new opt-in process is in place.

acdha commented 2 days ago

For anyone else who's been suffering downtime thanks to the ECS service regression described in this ticket & #2394, I tried to have support disable it for our accounts but found that did not work: SVC is still pushing services into SERVICE_TASK_START_IMPAIRED if they use things like the Amazon X-Ray, CloudWatch, etc.

I ended up deploying a little bit of EventBridge + Lambda to avoid ECS-triggered downtime. This uses an EventBridge rule to trigger a Lambda for ECR push events on the repositories in question and that Lambda calls ecs:UpdateService for each service using that container to force a new deployment which will resolve the tag to the current digest value. With the various work to manage IAM entities, least-privilege policies, etc. this seems like an unnecessary amount of work simply to get back to the level of reliability which ECS had from its launch until June.