Closed schm closed 1 year ago
Uh oh, this looks like a new bug for us. I'll look into reproducing; we're hearing from multiple customers with ECR-related problems.
In the meantime, could you share a few things? In particular, what would be useful are:
StackSet-${APPNAME}-infrastructure-${UUID}/${UUID}
. This is where we actually store the ECR repositories; they're deployed once per region in the tools account (the account where you ran copilot app init
or copilot init
originally). This will help greatly in debugging for us.
This would be the template of the stack set
I cannot share a cloudformation template in this case, as this deployment fails before even creating the template.
I think I see what's happening here. It looks like there's only one ECR repo that's being created after app upgrade
. I think this comes from a bug in our code which opts out of ECR repo creation for static site patterns. But I will need to narrow things down to be sure.
In the meantime, while we work to fix it, is it possible for you work around this by creating an ECR repo outside of copilot management, building and pushing to it manually, and specifying it in image.location
in your manifests? I realize this is a lot of work and it's totally our fault that you're blocked, but I don't know if we have a good story for downgrading an app.
I will update this issue with details as I work through it.
@bvtujo Hey, I wonder how you guys are handling the priorities of the issue, Is there any rules for that?
@schm In the meantime you can try this customer's workaround; they seem to have the same problem as you where the ECR repos got deleted improperly.
Same issue here without running app upgrade
: we are currently unable to deploy new versions of our applications in production without deleting the copilot job and recreate it.
We tryed with both version 1.28.0 and 1.26.0.
@bvtujo this is a major issue for us, we are really worryed that this issue is 2 weeks old with no significant updates
Edit: just to clarify, the ECR repository exists when the error appears. You have to delete it manually after copilot job delete
.
The workaround posted is not applicable to Scheduled jobs.
Hello @acamb.
we are really worryed that this issue is 2 weeks old with no significant updates
This only happens when trying to use an older version of Copilot to update Copilot application that was lastly updated by a newer version of Copilot. We are actually working on an enhancement to prevent users from doing that avoids any template downgrade. Sorry again for the inconvenience and please let us know if you are still worried about this.
Hi @iamhopaul123 I'm not sure this is happening only if you use an older version of Copilot: i use v 1.28 and i haven't switched back to any older version (i used the 1.26 only to test it while reporting the issue). Maybe the issue is triggered also by deploying a new version when the older one was created with an older version of Copilot?
Please let me know if there is a better workaround than deleting the job with copilot delete
.
Thanks, Andrea
Maybe the issue is triggered also by deploying a new version when the older one was created with an older version of Copilot?
I've tested to create and deploy something with v1.26
and then switched to v1.28
to create and deploy a new job and then did job run
, but there doesn't seem to be any backward-incompatible issue.
Please let me know if there is a better workaround than deleting the job with copilot delete.
One workaround I think would be
copilot job init
again to add the job to the application again (this should recreate the ECR repo)copilot job deploy/run
@iamhopaul123 it's strange because im using v.1.28 and today i've got this problem 3 times (without touching the manifest).
In one case the deploy failed and in the other two the state machine failed with error "failed to normalize image reference ..." when running the job(issue #5032 ). They where jobs that I haven't touched for weeks/months and the previous task version was likely deployed with an older Copilot version.
On the jobs where i did copilot job delete
ecc the following deploys and runs are going fine.
Monday morning I will try the workaround you suggested.
@iamhopaul123 the workaround only avoids the error while running copilot job init
after copilot job delete
, but after deploying and running the ECS i'm still getting the error:
InternalError: failed to create container model: failed to normalize image reference [...]
Instead if i manually delete the ECR the job runs fine after another cycle of delete-init-deploy.
@iamhopaul123 I can confirm that the issue is happening also with scheduled jobs created with Copilot v1.28.0 and re-deployed with the same version.
@acamb Hello! I'm sorry that you are still facing the issues :(
it's strange because im using v.1.28 and today i've got this problem 3 times (without touching the manifest). In one case the deploy failed and in the other two the state machine failed with error "failed to normalize image reference ..." when running the job(issue https://github.com/aws/copilot-cli/issues/5032 ).
You mentioned that "in one case the deploy failed", do you happen to know what the error message was? I think you could still find the record in the CloudFormation console (or aws cli, whichever you prefer) - go to the stack's "Events" tab and locate the event with an UPDATE_FAILED
state. I'm hoping to get more clues by knowing this error message.
In addition, can you confirm the value of the ContainerImage
parameter in your job's stack? Is it something like ": fae9f246" instead of "
Hello @Lou1415926 When the deploy fails we see an error like
- Updating the infrastructure for stack tech-staging-oreo-dl [update rollback complete] [15.3s]
The following resource(s) failed to update: [TaskDefinition].
- An ECS service to run and maintain your tasks in the environment cluster [not started]
- An ECS task definition to group your containers and run them on ECS [delete complete] [0.0s]
Resource handler returned message: "Invalid request provided: Create T
askDefinition: Container.image repository should not be null or empty.
(Service: AmazonECS; Status Code: 400; Error Code: ClientException; R
equest ID: abc12f74-a49a-42f9-ac85-418debf2f7b2; Proxy: null)" (Reques
tToken: 1d5e884d-bb98-74b6-fddb-8f2bc2265329, HandlerErrorCode: Invali
dRequest)
For the other case (issue #5032) i can confirm that the ContainerImage in the task definition is in the format ":xxxx" without the ".idk.ecr...." prefix.
The enhancement that prevents version downgrades has been released in v1.29.0: https://github.com/aws/copilot-cli/releases/tag/v1.29.0!
@huanjani Thanks for the update.
Is this a server side check or is this built into the CLI. I.e. will this now block clients < 1.29 from interacting with my updated app? Or will this check only work in the future for all clients >= 1.29 (e.g. blocking a 1.29 client from accessing a 1.30 app)
Hello @schm.
Is this a server side check or is this built into the CLI.
It is built into the CLI.
will this now block clients < 1.29 from interacting with my updated app? Or will this check only work in the future for all clients >= 1.29 (e.g. blocking a 1.29 client from accessing a 1.30 app)
I think "blocking a 1.29 client from accessing a 1.30 app" this one is a correct statement (if by "client" you meant Copilot CLI), so that your 1.29 client won't be able to accidentally downgrade your 1.30 app (however, this can be overridden by passing --allow-downgrade
flag).
That's very good to know. Thanks for addressing this issue.
For me this issue is resolved right now as we know what was causing the problems and how we can avoid them in the future. Therefore I'm going to close it even though we didn't find a good solution to fix services affected by this problem but completely delete and recreate them.
Thanks again for your support. That's much appreciated.
Hey,
last week we upgraded our main copilot app by running
app upgrade
. Since then we're running into strange issues when redeploying any kind of service in the same app.We were creating the app with copilot 1.22.0. We've been using the latest release of the copilot-cli for each deployment. And we've only triggered
app upgrade
last week in order to use static sites.For App Runner based services we see the following error:
For ECS based services we see this error:
So both errors seem to be related to ECR.
What we found out already:
app upgrade
. but for others the next deployment directly failed.I would love to get any feedback on how we can further debug this issue as this is blocking our teams. I'll happily provide more information, if you tell me which.