codedeploy-agent gets stuck on the wrong deployment ID, incorrectly reports missing scripts

gh-andre commented 1 year ago

CodeDeploy agent keeps trying to access an invalid application revision and fails to find scripts to run in application events. The original configuration where this occurred is a CodePipeline instance, but now when I create a deployment just for this application and deployment group, it keeps selecting the same bad deployment revision.

It's happening on Ubuntu 18.04. Not sure how to obtain codedeploy-agent version, as it doesn't implement --version. The closest I can see is in the logs:

 Version file found in /opt/codedeploy-agent/.version with agent version OFFICIAL_1.4.1-2244_deb

The error I'm getting is this:

aws-codedeploy-bad-script-ref

Notice that the deployment ID following the deployment group ID is different from the actual deployment shown in details. This is the problem - that other deployment ID isn't even for this application, but another application that was deployed to the same VM at a different time.

Checking this deployment, it is properly picking up correct build artifacts, which do contain the script that it reports to be missing:

aws-codedeploy-bad-script-ref-2

The revision location shown in details does contain the script that is reported as missing, so codedeploy-agent is definitely trying to refer to a bogus deployment ID when it is trying to locate the application revision.

Looking at the codedeploy-agent logs, it just reports this error that is visible in UI and no other useful information. In verbose mode, it references the correct application revision and, on the next log line after this, is trying to open appspec.yml for the bogus deployment ID reported in the error.

I also noticed that deployment ID referenced in revision details in the screenshot above is different for manually-created deployments in that it may be different from the current deployment ID in the same screenshot and seems to be referring to the deployment ID that was active at the time CodePipeline was running, when it got a notification from the CodeBuild stage. Mentioning this because it sounds plausible that this is where the original deployment ID mix-up happened.

t0shiii commented 1 year ago

Can you provide a clear set of steps for reproduction?

gh-andre commented 1 year ago

It happened for me with a large project, so it's hard for me to reproduce this in a smaller setup. I do realize that it doesn't make it easier for you and will understand if you want to close this issue as unreproducible.

Having said that, I think the problem is in how codedeploy-agent handles concurrent deployments on the same VM. Let me describe the setup in case you want to try it out locally.

I have 3 source repositories used by a CodePipeline instance, say A, B and C. All 3 source repositories are pulled into source/build stages as parallel stage actions, so they are being built concurrently.

Thinking that codedeploy-agent would be able to deploy one set of artifacts at a time from each of these build projects, I configured the deployment stage with 3 parallel actions as well for A, B and C, all against the same deployment target (3 deployment apps/groups select the same VM). Deployment scripts for each of these do not intersect in any way and can run simultaneously on the same VM.

This is what I think broke codedeploy-agent - it's as if it uses some shared intermediate directory for deployments and the component that picks up artifacts appears to do it while another deployment is in progress. I do understand the directory structure under deployment-root and think there's some other shared directory.

Once I realized that CodeDeploy wasn't meant to work in this configuration, I restructured the deployment stage as a series of sequential action groups, but it appears that something was damaged under /opt/codedeploy-agent for one of the deployment groups in that B was getting artifacts for A.

I had to delete all three deployment applications, along with their deployment groups and create new ones to make sequential action groups work in that deployment stage. This step fixed the broken deployment stage with sequential action groups and I the pipeline now works as intended.

fleaz commented 1 year ago

Hey Andre,

I suspect what you are seeing is not a bug in the codedeploy-agent but rather the (unexpected) default behavior from Codedeploy itself: The "ApplicationStop" lifecycle hook is always called from the previous deployment. That's why you are seeing two different deployment IDs in your screenshot.

The reason for this, is that ApplicationStop is run before DownloadBundle, therefore the stop script from the previous deployment is used to stop the application, then the new artifact is downloaded and the actually new deployment begins.

EDIT: Just saw the "that other deployment ID isn't even for this application, but another application that was deployed to the same VM at a different time" part. I'm not sure if the agent even has an understanding of "applications" or just has a list of deployment IDs for this instance and then just blindly runs the application stop from the previous deployment. We once also had a setup with multiple CodeDeploy deployments to the same VM but got rid of it because this caused so many headaches.

gh-andre commented 1 year ago

@fleaz Thank you for the insights about stopping the application. It does make sense to use the script from the previous deployment, not only because it's done before the next one is downloaded, but even more so because the stopping script in the incoming app may have changed and won't work for stopping the previous app.

I'm not sure if the agent even has an understanding of "applications" or just has a list of deployment IDs for this instance and then just blindly runs the application stop from the previous deployment.

It appears that CodeDeploy is aware of applications because now I am deploying applications A, B and C one at a time, instead of doing it in parallel, as I initially set it up, and if CodeDeploy wasn't aware of applications, it would run the stop script for C the next time A would be deployed, but it appears to be running the correct script for C.

From what I was observing, it is as if CodeDeploy uses some shared area for all deployments and when multiple deployments ran at the same time, one deployment stomped on another.

Perhaps folks testing CodeDeploy could set up 3 applications to run and parallel and see if it works. Not saying it should, but CodePipeline docs should probably highlight that parallel actions against the same VM should not be set up.

fleaz commented 1 year ago

From what I was observing, it is as if CodeDeploy uses some shared area for all deployments

That indeed sounds like your problem. They all get saved to /opt/codedeploy-agent/deployment-root/<random-id>/.... There is a folder for every deployment on this machine.

Your Codedeploy agent config (Probably saved at /etc/codedeploy-agent/conf/codedeployagent.yml) has a :max_revisions: variable (see the docs). The default is just five revisions. So if you deploy three different applications to the same host, keeping just five revisions won't keep the current and the latest for all three applications (3 Apps * 2 Revisions = 6 ) which would cause your described problem. And this is just with the best case scenario of all getting deployed the same time. Deploy C five times, and all your old revisions for A and B are gone...

I would try to increase the :max_revisions:, if you have enough storage on the machine. If your problem is gone, we have found your problem :)

gh-andre commented 1 year ago

@fleaz

The default is just five revisions.

Thanks for the heads-up. My app A doesn't have a stop script, which is probably why I didn't notice the limit of 5 kicking in. Very useful to know because I was just about to introduce a stopping script to that app.

If your problem is gone, we have found your problem

The problem was actually running deployments for A, B and C as parallel actions within their deployment stage, which I remedied by restructuring the stage to run actions sequentially (had to delete/recreate all apps- CodeDeploy kept the state and just changing the stage didn't work).

In other words, it's the parallelism in multiple CodeDeploy actions running in parallel vs. the limitation of maximum number of deployments being tracked. The former messed it up so much that I had to recreate a bunch of configuration to get it working again. The latter is something that can be managed via configuration, as you pointed out.

Thanks again for your insights. Much appreciated.

aws / aws-codedeploy-agent

codedeploy-agent gets stuck on the wrong deployment ID, incorrectly reports missing scripts #348