aws / containers-roadmap

This is the public roadmap for AWS container services (ECS, ECR, Fargate, and EKS).
https://aws.amazon.com/about-aws/whats-new/containers/
Other
5.21k stars 316 forks source link

Issues with "software version consistency" feature #2394

Open gilad-yadgar opened 2 months ago

gilad-yadgar commented 2 months ago

EDIT: this is related to the "software version consistency" feature launch, see What's New post: https://aws.amazon.com/about-aws/whats-new/2024/07/amazon-ecs-software-version-consistency-containerized-applications/

Summary

since our EC2 upgraded to ecs-agent v1.83.0, images used for containers are with sha digest and not image tag

Description

we started getting different image value for the '{{.Config.Image}}' property using docker inspect in our ECS EC2. we are getting sha digest as the .Config.Image instead of getting the image tag. the task definition contains the correct image tag (and not the digest)

we need the image tag since we rely on that custom tag to understand what was deployed. what can be done?

Expected Behavior

we expect to see the image tag used for the container

Observed Behavior

we get image digest used for the container

Environment Details

dg-nvm commented 2 months ago

same

scott-vh commented 2 months ago

FWIW today I've encountered a production incident after updating to ecs-agent 1.83.0 roughly 2 weeks ago where I saw a subset of our ECS tasks fail to start with:

CannotPullContainerError: failed to tag image '012345678910.dkr.ecr.<region>.amazonaws.com/<repo>:<tag>@sha256:<digest>' 

This was a surprising error to see given that the only change on our side that we can attribute to this is our agent version upgrade 🤷 and it feels similar enough to be worth a mention given the digest in the error message.

This seemed to be isolated to a small fraction of our cluster instances (all running 1.83.0) and tasks from the same task revisions yielding the error eventually phased in without intervention.


I've also happened to notice that https://github.com/aws/amazon-ecs-agent/pull/4181 intends to help augment these kinds of errors with some more useful context and made it into agent release 1.84.0 so I'll report back if/when we upgrade and whether or not that yields anything of use 👍

EDIT: didn't touch the 1.84.0 upgrade after seeing this comment

tomdaly commented 2 months ago

this has also caused production issues for my org. we use the ImageName value available in the ECS container metadata file at runtime, as we tag our ECR images with the Git commit SHA. this is then used for a variety of things in different services such as sourcing assets, tracking deploys, etc.

since 1.83.0 ImageName is sometimes present as the SHA digest instead of the image ID, which we expected to be within ImageID and not ImageName.

panny-P commented 2 months ago

I still found this error on ecs-agent 1.84.0.

mvanholsteijn commented 2 months ago

We have production issues with the change too, when the tag is re-used for a new image layer and the old image is deleted.

timdarbydotnet commented 2 months ago

I'm also seeing the issue where a newly pushed and tagged "latest" image is being ignored and the agent will only use the older untagged instance. This needs to be fixed ASAP or at least give us a workaround. I'm seeing this behavior on agent 1.83.0. This was not happening on 1.82.1.

turacma commented 2 months ago

We are also seeing this issue in our environment. It doesn't seem to happen with all images. FWIW, on the same container instance, we can see some containers with tags and others without and if a container is one with tags, it's the first launched container.

turacma commented 2 months ago

FWIW, this also impacts the ECS APIs, specifically describe-tasks

https://www.reddit.com/r/aws/comments/1dtgc4b/mismatching_image_uris_tag_vs_sha256_in_listtasks/

Unclear if the source of truth (and the root cause) is the agent or the APIs themselves, but just though it's worth noting this.

joelcox22 commented 2 months ago

Found this issue after internal investigation of an incident that seems likely related to this. If it helps anyone else, here's my analysis of how this impacted a service that was referencing an ECR image based on a persistent image tag that we were regularly rebuilding and overwriting, and had automation in place for deleting the older untagged images

I have an open support case with AWS to confirm this behaviour, and have included a link to this github issue.

sequenceDiagram
participant jenkins as Jenkins
participant cloudformation as Cloudformation
participant ecs-service as ECS Service
participant ec2-instances as EC2 Instances
participant ecr-registry as ECR Registry
participant docker-base-images as Docker Base Images<br />firelens sidecar image
participant ecr-lifecycle-policy as ECR Lifecycle Policy
jenkins ->> cloudformation: regular deployment
cloudformation ->> ecs-service: creates a new "deployment" for the service
activate ecs-service
note right of ecs-service: ECS resolves the image hash<br />at time of "deployment" creation
ecs-service ->> ec2-instances: starts tasks with resolved image hashes
ec2-instances ->> ecr-registry: pulls latest image from ECR
docker-base-images ->> ecr-registry: rebuid and push image regularly
ecr-lifecycle-policy ->> ecr-registry: deletes older images periodically
note right of ecs-service: periodically, new tasks need to start
ecs-service ->> ec2-instances: starts tasks with previously resolved image hashes
ec2-instances ->> ecr-registry: attempts to run the same image hash from earlier<br />if the image already exists on the instance, its fine<br />otherwise, it needs to pull from ECR again and may fail
ec2-instances ->> ecs-service: tasks fail to launch due to missing image
note right of ecs-service: at this point, the service is unstable<br />might have existing running tasks<br /> but it can't launch new ones
create actor incident as Incident responders
ecs-service ->> incident: begin investigation
note left of incident: "didn't this happen the other day<br />for another service?" *checks slack*
note left of incident: Yeah, it did happen, and the outcome<br />was that we disabled the ECR lifecycle<br />policy, but services were left with<br />the potential to fail when tasks cycle
incident ->> jenkins: trigger replay of latest production deployment early and hope that fixes the issue
jenkins ->> cloudformation: deploy
cloudformation ->> incident: "there are no changes in the template"
incident ->> jenkins: disable the sidecar to get the service up and running again quickly and buy more time for investigation
jenkins ->> cloudformation: deploy with sidecar disabled
deactivate ecs-service
cloudformation ->> ecs-service: create new deployment without sidecar
activate ecs-service
note right of ecs-service: no longer cares about firelens sidecar image
ecs-service ->> ec2-instances: starts new tasks
ec2-instances ->> ecs-service: success
ecs-service ->> incident: service is up and running again, everyone is happy
note left of incident: "but we're not done yet"
incident ->> jenkins: re-enable the sidecar
jenkins ->> cloudformation: deploy with sidecar enabled
deactivate ecs-service
cloudformation ->> ecs-service: create new deployment with sidecar
activate ecs-service
note right of ecs-service: ECS resolves the image hash<br />at time of "deployment" creation
ecs-service ->> ec2-instances: start new tasks
ec2-instances ->> ecr-registry: pulls new images with updated hash
ec2-instances ->> ecs-service: success
ecs-service ->> incident: service is stable again
note left of incident: This service looks good again now<br />but other services might still have a problem
deactivate ecs-service
incident ->> ecs-service: work through "Force New Deployment" for all services in all ecs clusters & accounts
note left of incident: all services are now expected to be<br />stable, as everything should be<br />referencing the latest firelens image<br />hash, and the lifecycle policy<br />to delete older ones is disabled
L3n41c commented 2 months ago

This issue most probably comes from aws/amazon-ecs-agent#4177 merged in 1.83.0:

Expedited reporting of container image manifest digests to ECS backend. This change makes Agent resolve container image manifest digests for container images prior to image pulls by either calling image registries or inspecting local images depending on the host state and Agent configuration. Resolved digests will be reported to ECS backend using an additional SubmitTaskStateChange API call

sjmisterm commented 2 months ago

Downgrading to 1.82.4 in our case does not make the issue go away, indicating that, even if it was related to the agent, the digest information is now somehow cached by ECS. We are currently using a DAEMON ECS service.

According to a recent case opened with AWS support, "ECS now tracks the digest of each image for every service deployment of an ECS service revision. This allows ECS to ensure that for every task used in the service, either in the initial deployment, or later as part of a scale-up operation, the exact same set of container images are used." They added this is part of a rollout that started in the last few days of June and is supposed to complete by Monday.

Their suggested solution is to update the ECS service with "Force new deployment" to "invalidate" the cache. If you have AWS support, try to open a case including this information to see how they evaluate your issue.

joelcox22 commented 2 months ago

I got a similar response to @sjmisterm in my support case, confirming the new behaviour is expected, and stating that we should no longer delete the images from ECR until we're certain that the images are no longer in use by any deployment.

This change effectively means ECR lifecycle policies to delete untagged images are expected to cause outages unless additional steps are taken immediately after every time an image is deleted to ensure every deployment that was referencing a mutable tag is redeployed. This is particularly problemattic for my specific use-case where we were referencing a mutable tag for a sidecar container that we include for many services.

I've asked if there is any future roadmap plans to make this use-case easier to manage, and requested for a comment from AWS on this github issue 😄

... https://xkcd.com/1172/

sjmisterm commented 2 months ago

AWS has confirmed this is definitely caused by them and they think this is a good feature, as the links (made available yesterday) show

https://aws.amazon.com/about-aws/whats-new/2024/07/amazon-ecs-software-version-consistency-containerized-applications/ https://aws.amazon.com/blogs/containers/announcing-software-version-consistency-for-amazon-ecs-services/ https://docs.aws.amazon.com/AmazonECS/latest/developerguide/deployment-type-ecs.html#deployment-container-image-stability

There's no way to turn off this new behaviour, which completely breaks the easiest workflow for blue-green deployments - I'm sure tons of people have other cases that need or benefit from the old one.

I suggest all who have AWS support to file a case and to request an API to turn this off by service / cluster / account.

amogh09 commented 2 months ago

Hello. I am from AWS ECS Agent team.

As shared by @sjmisterm above, the behavior change that customers are seeing is because of the recently released Software Version Consistency feature. The feature guarantees that same images are used for a service deployment by recording image manifest digests reported by the first launched task and then overriding tags with digests for all subsequent tasks of the service deployment.

Currently there is no way to turn off this feature. ECS Agent v1.83.0 included a change to expedite the reporting of image manifest digests but older Agent versions also report digests and ECS backend will override tags with digests in both cases. We are actively working on solutions to fix the regressions our customers are facing due to this feature.

amogh09 commented 2 months ago

One of the patches we are considering is - instead of overriding :tag with @sha256:digest, we would override it with :tag@sha256:digest so that the lost information is added back to the image references.

sjmisterm commented 2 months ago

@amogh09 , I can't see how this would address the blue-green scenario. Could you explain it, please?

amogh09 commented 2 months ago

There's no way to turn off this new behaviour, which completely breaks the easiest workflow for blue-green deployments

@sjmisterm Can you please share more details on how this change is breaking blue-green deployments for you?

sjmisterm commented 2 months ago

@amogh09 , sure.

Our blue-green deployments work by deploying a new image to the ECR repo tagged with latest and then launching a new EC2 instance (from the ECS-optimized image, properly configured for the cluster) while we make sure the new version works as expected in production. Then, we start to progressively drain the old tasks until only new tasks are available.

sjmisterm commented 2 months ago

@amogh09 in summary: the software version "inconsistency" is what makes blue green a breeze with ECS. Should we want consistency, we'd use a digest or a version tag.

amogh09 commented 2 months ago

@sjmisterm Deployment unit for an ECS service is a TaskSet. The software version consistency feature guarantees image consistency at TaskSet level. In your case, how do you get a new task to be placed to the new EC2 instance? The new task needs to be a part of a new TaskSet to get the newer image version. If it belongs to the existing TaskSet then it will use the same image version as its TaskSet.

ECS supports blue-green deployments natively at service level if the service is behind an Application Load Balancer. You can also use External deployment type for an even greater control over the deployment process. Software Version Consistency feature is compatible with both of these.

timdarbydotnet commented 2 months ago

@amogh09 I use a network load balancer and the LDAP container instances I'm running will not respond well to this new model. If I can't maintain the ability to pull the tagged latest image, I will have to stop using ECS and manage my own EC2s, which would be painful frankly.

Looking at the ECS API, what would happen if I called DeregisterTaskDefinition and then RegisterTaskDefinition. Would that have the effect of forcing ECS to resolve the digest from the new latest image without killing the running tasks?

sjmisterm commented 2 months ago

@amogh09 , I think we're talking about different things. Until the ECS change, launching properly a new ECS instance properly configured for a ECS daemon service whose taskdef is tagged with :latest would launch the new task with, well, the image tagged latest. Now it launches it using the digest resolved by the first task unless you force a new deployment in your service.

Our deployment scripts pre-dates CodeDeploy and the other features. So all your suggestions require rewriting deployment code because of a feature we can't simply opt-out.

amogh09 commented 2 months ago

I understand the frustrations you all are sharing regarding this change. I request you to contact AWS Support for your issues. Our support team will be able to assist you with workarounds relevant to your specific setups.

sjmisterm commented 2 months ago

@amogh09 , a simple API flag in the service / cluster / region / account would solve the problem. That's what we're trying to get across because it disturbs your customer base - not everyone pays support and the old behaviour, as you can see, is used by several of them.

mpoindexter commented 2 months ago

I'll chime in that we were negatively impacted by this change as well, and I don't think it helps anything for most scenarios.

Before, customers effectively had a choice: they could either enforce software version consistency by using immutable tags (https://docs.aws.amazon.com/AmazonECR/latest/userguide/image-tag-mutability.html), or if they wanted to allow for a rolling release (most useful for daemon services as @sjmisterm alluded to) they could achieve that as well by using a mutable tag.

Now, this option is gone with nothing to replace it, and very poor notification that it was going to happen to boot.

timdarbydotnet commented 2 months ago

I'm very disappointed with AWS on two counts:

scott-vh commented 1 month ago

I know that the circumstances around how we all got notified about this change aren't ideal, but is there anywhere where we can be proactive and follow along for similar updates that may affect us in the future? Did folks get a mention from their AWS technical account managers or similar?

I lurk around the containers roadmap fairly often, but don't see an issue/mention there or in any other publicly-facing aws github project around this feature release.

dg-nvm commented 1 month ago

@scott-vh the problem is that this is an internal API change, ECS backend behaves differently now. This has nothing to do with ecs-agent itself, regardless of version you will get same behaviour. Noone could see it coming

scott-vh commented 1 month ago

@dg-nvm Yep I got that 👍 I was just curious if there was any breadcrumb anywhere else for which we could've seen this coming (sounds like no, but wanted to see if anyone who interfaces with TAMs or ECS engineers through higher tiers of support got some notice)

dg-nvm commented 1 month ago

@scott-vh our TAM was informed about the problem but idk if there was any proposal. Given that I see ideas for workarounds accumulating I would say no :D Luckily our CD was not impacted by this, I can think of scenario that deamons deployments is easier using mutable tags, especially that ECS does not play nicely when replacing daemons. Sometimes they are stuck because they got removed from the host and something else was put in it's place in the meantime :)

turacma commented 1 month ago

@amogh09 one thing I think it's particularly jarring in my case is that the results at runtime are inconsistent. The first container that launches with the given deployment gets a value in ImageName field of the ecs metadata file that reflects the tag associated with the image in ECR, but all subsequent launches see the @SHA style format.

We've been using the value of ImageName as part of an identifier in our observability platform to differentiate deployments (we prefer this to task definition numbers as they may not match across deployment environments), so now we end up seeing 2 different values in our metrics and traces because there isn't a consistent value.

We can obviously work around this for our use case, but exactly what is the reasoning for this inconsistency?

amogh09 commented 1 month ago

@turacma The inconsistency in ImageName between first container and subsequent containers is because for first container the image name provided to Agent is what's in the task definition. It's the image manifest digest reported by Agent for this first container that is used to override tag with digest for subsequent container launches.

izeau commented 1 month ago

Is there any way to get the image from the container definition (previously visible in ImageName) anywhere in the DescribeTasks results? In a new attribute, maybe? Seems like it would help a lot.

mvanholsteijn commented 1 month ago

@izeau I cannot answer your question, but you can resolve this by adding the digest in your task definition yourself. In that case:

{
    "taskDefinitionArn": "arn:aws:ecs:eu-central-1:123456789012:task-definition/paas-monitor:2",
    "containerDefinitions": [
        {
            "name": "paas-monitor",
            "image": "mvanholsteijn/paas-monitor:3.1.0"

becomes:

{
    "taskDefinitionArn": "arn:aws:ecs:eu-central-1:123456789012:task-definition/paas-monitor:2",
    "containerDefinitions": [
        {
            "name": "paas-monitor",
            "image": "mvanholsteijn/paas-monitor:3.1.0@sha256:c0717cab955aff0a3d2f6bb975808ba9708d8385bcf01a18e23ff436f07c1fb3"

This puts you back in control and you can see which version of the image is deployed.

The utility cru can resolve the digest and update your task definition, while preserving the tag as a human readable name:

$ cru update --verbose  --resolve-digest --image-reference mvanholsteijn/paas-monitor:3.1.0  paas-monitor.json
2024/07/19 10:55:05 resolving repository mvanholsteijn/paas-monitor Tag 3.1.0 to Digest sha256:c0717cab955aff0a3d2f6bb975808ba9708d8385bcf01a18e23ff436f07c1fb3
2024/07/19 10:55:05 INFO: updating reference mvanholsteijn/paas-monitor:3.1.0 to mvanholsteijn/paas-monitor:3.1.0@sha256:c0717cab955aff0a3d2f6bb975808ba9708d8385bcf01a18e23ff436f07c1fb3 in paas-monitor.json
2024/07/19 10:55:05 INFO: updated a total of 1 files
2024/07/19 10:55:05 INFO: no commit message, skipping commit and push

If you want to add a digest to all the image references in your task definitions, type:

$ cru update --all --matching-tag --resolve-digest .
2024/07/19 11:03:05 INFO: collecting all container references
2024/07/19 11:03:05 INFO: 1 image references found
2024/07/19 11:03:05 resolving repository mvanholsteijn/paas-monitor Tag 3.1.0 to Digest sha256:c0717cab955aff0a3d2f6bb975808ba9708d8385bcf01a18e23ff436f07c1fb3
2024/07/19 11:03:05 INFO: mvanholsteijn/paas-monitor:3.1.0@sha256:c0717cab955aff0a3d2f6bb975808ba9708d8385bcf01a18e23ff436f07c1fb3 already up-to-date in paas-monitor.json
2024/07/19 11:03:05 INFO: no files were updated by cru

Cru will also update the image reference when the tag has been replaced:


% docker tag mvanholsteijn/paas-monitor:3.1.0 mvanholsteijn/paas-monitor:latest
% docker push mvanholsteijn/paas-monitor:latest 

 % cru update --verbose --all --matching-tag --resolve-digest .         

2024/07/19 12:02:51 INFO: collecting all container references
2024/07/19 12:02:51 INFO: 1 image references found
2024/07/19 12:02:51 resolving repository mvanholsteijn/paas-monitor Tag latest to Digest sha256:c0717cab955aff0a3d2f6bb975808ba9708d8385bcf01a18e23ff436f07c1fb3
2024/07/19 12:02:51 INFO: updating reference mvanholsteijn/paas-monitor:latest to mvanholsteijn/paas-monitor:latest@sha256:c0717cab955aff0a3d2f6bb975808ba9708d8385bcf01a18e23ff436f07c1fb3 in paas-monitor.json
2024/07/19 12:02:51 INFO: updated a total of 1 files
2024/07/19 12:02:51 INFO: no commit message, skipping commit and push

% docker tag mvanholsteijn/paas-monitor:3.4.0 mvanholsteijn/paas-monitor:latest
% docker push mvanholsteijn/paas-monitor:latest 

% cru update --verbose --all --matching-tag --resolve-digest .                 
2024/07/19 12:03:14 INFO: collecting all container references
2024/07/19 12:03:14 INFO: 1 image references found
2024/07/19 12:03:15 resolving repository mvanholsteijn/paas-monitor Tag latest to Digest sha256:fdcfbed7e0a7beb8738e00fe8961c8e33e17bdeee94eab52cb8b85de1d04d024
2024/07/19 12:03:15 INFO: updating reference mvanholsteijn/paas-monitor:latest@sha256:c0717cab955aff0a3d2f6bb975808ba9708d8385bcf01a18e23ff436f07c1fb3 to mvanholsteijn/paas-monitor:latest@sha256:fdcfbed7e0a7beb8738e00fe8961c8e33e17bdeee94eab52cb8b85de1d04d024 in paas-monitor.json
2024/07/19 12:03:15 INFO: updated a total of 1 files
mvanholsteijn commented 1 month ago

@izeau it looks like the original image reference is included in the task

% aws ecs describe-tasks --tasks $(aws ecs list-tasks --cluster default --query 'join(`\n`, taskArns[*])' --output text )
{
    "tasks": [
        {
           ...
            "containers": [
                {
                    "containerArn": "arn:aws:ecs:eu-central-1:123456789012:container/default/4fbe5116d73044b88756214721d0b981/085e40c4-b4df-490f-b1b4-9cce38c32cd7",
                    "taskArn": "arn:aws:ecs:eu-central-1:123456789012:task/default/4fbe5116d73044b88756214721d0b981",
                    "name": "paas-monitor",
                    "image": "mvanholsteijn/paas-monitor:3.1.0@sha256:c0717cab955aff0a3d2f6bb975808ba9708d8385bcf01a18e23ff436f07c1fb3",
                    "imageDigest": "sha256:c0717cab955aff0a3d2f6bb975808ba9708d8385bcf01a18e23ff436f07c1fb3",
                    "runtimeId": "4fbe5116d73044b88756214721d0b981-1836980517",
                    "lastStatus": "RUNNING",
izeau commented 1 month ago

@mvanholsteijn I'm not sure I understand, my use case is to find out if all tasks in a set of tasks are running e.g. v1.2.3. What I previously did was I set this as the image tag in the container definitions, then used the DescribeTasks operation with the list of task identifiers, and looked at the image property to ensure it referenced the version I wanted.

Now I need to first fetch the SHA256 for the tag before checking tasks, and check the imageDigest property instead since the image one is not always the hash.

This is an extra step I would like to avoid, especially since I don't mutate tags.

It is also a step backwards for our monitoring processes since we also need to work in the other direction to find out what version is behind which hash. I will not set the version in the image itself because we build images first, then promote them with a semver tag when releasing, without building them again.

I'm left with environment variables or docker labels, both of which are not only leaky abstractions but also not returned in the DescribeTasks response.


This change was unannounced, broke workflows, and could have been avoided by using the existing imageDigest field instead of updating the image one -- or using a new, dedicated immutableImage one maybe? Either way, now that this is done, an additional originalImage or something would be great.

mvanholsteijn commented 1 month ago

@izeau it is indeed sad that this change broke the interface and your monitoring processes.

sparrc commented 1 month ago

Hi all, I have transferred this issue into the containers-roadmap repo. As far as I understand it, people are experiencing issues with this feature as a whole, rather than an issue with the ECS agent behavior specifically. For reference, see what's new post: https://aws.amazon.com/about-aws/whats-new/2024/07/amazon-ecs-software-version-consistency-containerized-applications/

Please feel free to continue adding your +1 and providing feedback :)

simonulbrich commented 1 month ago

This issue is affecting us as well. we utilize an initialization container that runs before the app container. this init container sets up monitoring integrations and settings that are not critical to the app itself but with a limited team we rely on the mutable tags to handle the "rolling" update as tasks are restarted. to force an application deployment for each application that my team manages for these small config updates would be an impossible task. Is there any way at all to prevent this "consistency" feature for a single container or disable entirely at the task level?

It seems like this feature was already solved with tag immutability, giving us the option to use mutable tags if we actually needed that behavior.

acdha commented 1 month ago

This regression caused a minor production outage for us because AWS' monitoring tools like X-Ray recommend using mutable tags, which means that if any of those has a release outside of your deployment cycle you are now set up to have all future tasks fail to deploy because you followed the AWS documentation:

I think this feature was a mistake and should be reverted – there are better ways to accomplish that goal which do not break previously-stable systems and immutable tags are not suitable for every situation, as evidenced by the way the AWS teams above are using them – but if the goal is to get rid of mutable tags it should follow a responsible deprecation cycle with customer notification, warnings when creating new task definitions, some period where new clusters cannot be created with the option to use mutable tags in tasks, etc. since this is a disruptive change which breaks systems which have been stable for years and there isn't a justification to break compatibility so rapidly.

vat-gatepost-BARQUE commented 1 month ago

We are also having an issue with this. Our development environment is setup to have all the services on a certain tag that keeps us from having to redeploy. They can simply stop the service and it comes back up with the most current image with that tag. Now they are having to update the service which is more steps than needed. This also seems to be a problem with our lambdas that spin up Fargate tasks and those tasks are not the most current version of the tag now. The update service is not an option on these so we are still trying to work that out.

mvanholsteijn commented 1 month ago

The strangest thing is that the feature was already available for those who wanted this. You can specify the container image with digest and that would pin the image explicitly. No code changes required to ECS.

floating potentially inconsistent -> my-cool-image:latest fixed -> my-cool-image:latest@sha256:fdcfbed7e0a7beb8738e00fe8961c8e33e17bdeee94eab52cb8b85de1d04d024

tomkins commented 1 month ago

Also had an issue with one of our sites which I believe is related to this - a container pulling from an ECR repository with a lifecycle policy, EC2 instance restarts - and ECS wants to pull the non-existant old image as there hasn't been a fresh deploy of a container for weeks.

The version consistency is a fantastic feature, but there are situations where I want the tag to be used rather than the image digest at last deploy.

vibhav-ag commented 1 month ago

Sorry for the late response on this thread- we're aware of the impact this change has had and apologize for the churn this rollout has created. We've been actively working through the set of issues that have been highlighted on this thread and have 2 updates to share: 1/for customers who've been impacted by the lack of ability to see image tag information, we're working on a change that will bring back image tag information in the describe-tasks response, in the same format as was available prior to the release of version consistency (i.e image:tag). An important thing to keep in mind here is that if you run docker ps on the host, you will see the image in format image:tag but docker inspect will return image:tag@digest. 2/ We're also working on adding a configuration in the container definition that will allow you to opt-out of digest resolution for specific containers within your task- this should address both customers who want to completely opt out of digest resolution as well as customers who want to disable resolution for specific sidecar containers. I'll be using this issue to share updates on the change to bring back image tag information in describe-tasks and use issue #2393 for the change to disable digest resolution for specific containers. We're tracking both changes at high-priority.   Once again, we regret the churn this change has caused you all. While we still believe version consistency is the right behavior for the majority of applications hosted on ECS, we fully acknowledge that we could have done a better job socializing these changes and addressing these issues before, rather than after making the change.

nitrotm commented 3 weeks ago

I can concur that this "software version consistency" change to ECS render the concept of services totally useless for us. We may have to fallback to manually deployed task (without services) but then we'll loose the watchdog aspects which we have to re-implement ourselves.

In short, we need to guarantee a few properties on our services running background jobs:

  1. a task cannot be stopped automatically within a deterministic time-frame. Therefore, we internally flag the task to stop accepting new jobs and letting it complete its currently assigned job. Only then, when the task is idle, we stop it and we relied on a nice property of ECS that it would automatically fetch that last container image associated with the tag (eg. 'latest').
  2. some of our services need to have at least N tasks up and running at all time (guaranteeing some kind of always-on property)
  3. some services are dynamically adjusted via auto-scaling groups due to the highly changing nature of the demand

These combined with the new constraint that all of the tasks within a service need to have the same image digest, means that we cannot roll out any update to our containers without breaking at least one property.

Tbh this feels like we may want to switch to a plain k8s solution were we can setup and manage our workloads with some degree of flexibility. Hopefully an opt-out solution will be available soon as mentioned above, but we are stuck with our deployments atm and need a solution asap.

peterjroberts commented 3 weeks ago

The forced addition of this feature also caused a significant production incident for us. We deliberately used mutable tags as part of our deployments, and an ECR lifecycle policy to remove the old untagged images after a period.

This should absolutely have been an opt-in feature, or opt-out but disabled for existing services. I'm glad to see that's now been identified and raised but should this feature not be reverted until that option is available? To prevent everyone affected having to redesign workflows or implement workarounds.

As has been pointed out by others, those that want consistency by container digest can already achieve that through either tag immutability, or referring to the digest explicitly in the task definition.