Closed ldraney closed 5 months ago
Investigated this and was unable to find any evidence the API containers were actually exiting. There are no logs showing gunicorn restarting. There are no logs in ECS I could find showing that the containers were restarting. The exit status that Datadog is reporting for API containers comes and goes.
The Datadog agent is 100% having issues though, and does need investigation. Possibly an upgrade to ddtrace
.
Hey team! Please add your planning poker estimate with Zenhub @cris-oddball @EvanParish @k-macmillan @kalbfled @ldraney @MackHalliday @mchlwellman
@ldraney to reach out to DD, update ticket with response info...will inform next steps re: working ticket.
Sent ddog a message, we'll see how they respond! I'll then update this ticket comment and maybe additional info based on that response.
Got a response from DDog team:
Hi Lucas,
Can you confirm if your Agent deployment configuration has the following?
- Name: DD_PROCESS_AGENT_ENABLED Value: "true"
If the above is set, you will need to set the pidMode parameter to task in the task definition for us to properly execute according to our doc here.
Sincerely, Spencer Phillips | Solutions Engineer | Datadog | Business Hours: 10 a.m. to 10 p.m. ET, Monday to Friday
From the AWS documentation for PID mode:
If task is specified, all containers within the specified task share the same process namespace.
Which would need to be set up via the task definition here:
"containerDefinitions": [
{
"name": "notification-api",
"image": "{will-be-replaced-by-ci}",
"pidMode": "task",
Okay, updated a branch and preparing to deploy. Before I need to get a little setup in Datadog to verify the containers are still failing BEFORE/AFTER my deploy to dev
I can't tell how Kyle got the "exited at" column in the above screenshot. Compared to what I see:
But the logs still clearly show exiting containers, which should be enough for now:
The errors kept coming, even after successful deployment. It may have been from the other tasks, so I will try pidmode on those as well.
But first I want to make sure this is related to the DD_ PROCESSAGENT ENABLED variable, as support suggested.
I'm deleting that variable for now and redeploying to see if the errors stop.
Then, I'll try PIDmode one more time. Getting our system processes in fargate seems like a good idea, so I'll try to get that working while I'm here (not outside of scope - should have been the original functionality); but if it doesn't work we'll just consider bug fixed for now, I suppose. Still thinking about it.
Successful deploy between 1:27-1:29 MST (celery finished deploying last), seeing outcome...
The logs did stop! Confirmed bug is related to DD_PROCESS_AGENT_ENABLED.
Next step, try readding this with pidMode: task. If not, this is an appropriate time to continue conversation with Datadog.
Successful deploy with pidmode: task
and logs seem promising! You can see errors have stopped for the last two deploys so far:
I'm going to give it a little time, and also see if we are getting process information in Datadog as the ddog documentation indicates we should.
We are indeed getting new process information:
Things are looking good! I'm going to adjust the other environments, check on these pages again to confirm containers have stopped exiting, and open the PR with appropriate links in the "tests" section
Having a conversation with Kyle related to the risks of this feature. Perhaps more notably, I noticed that the datadog container is marked as "essential", where it should actually be the notification-api container that is essential! That should be addressed asap, so that datadog failures don't affect notification-api containers and tasks.
The bug seems to be resolved, but there some mystery in these settings; Kyle wants a more robust investigation with the related configurations and their risks -- PidMode, Essential, and how our usage of PID may affect these settings -- all related to application stability.
I'm gathering documentation and preparing for effective conversation with AWS and maybe Datadog support.
I've sent a Datadog support and AWS support ticket
I've also collected the most relevant documentation I could find.
I'll synthesize findings and report soon.
Another thought: testing a ddog container failure is also a potential option. I can look into that if the above investigation doesn't turn out to be sufficient.
This page has been the most useful
A useful and relevant TLDR section is at the end, Caveats of sharing a process id namespace: While a process id namespace can now be shared by containers in the same ECS task on AWS Fargate, there a few things to be aware of. We’ll walk through those caveats in the context of the application and sidecar ECS task defined previously: A process in the sidecar container can observe, stop, or restart a process in the application container. When sharing a process in an ECS task, a new pause process runs as PID 1 for the whole task.
So, that explains the pause task we see now. It also means that, as long as the datadog container has no functionality that can stop the notification-api container (I would be greatly surprised if it did), that notification-api should run smoothly, regardless of the status of the container. To address your point earlier that things have crashed due to datadog, this should be moreso attributed to our ddtrace setup, and I'm in communication with the datadog team on this. This is because Fargate is essentially running docker-compose, each container has its own isolated resources, so even if one container crashed, it will not affect the other container unless there was a dependency relationship (e.g. notification-api depending upon datadog, which it doesn't).
I'm also asking Datadog about their guidance on essential, as their own guide declares essential on both containers in the task, which doesn't make sense to me.
From AWS customer support:
Setting pidMode to task would not cause all containers to fail by default. If, for example, you had a task with 3 containers with pidMode set to task where 2 of the tasks did something small like download a file and then exit. Having those containers exit would not automatically cause the whole task to fail even if they exited with a non-zero exit code.
AWS support is asking for more information on how we use PID, though assumes we are using it for graceful shutdown with ECS. Confirming this with Kyle. We may need to document our situation in this regard in our own team repo.
I see related PID settings in gunicorn_config.py
Most useful documentation so far: Announcing additional Linux controls for Amazon ECS tasks on AWS Fargate, which addresses some of our main concerns.
Notes from my latest convo with Kyle: PIDmode with exec - will it be fine? in our task definition we want run_celery.sh What is the impact of using exec with PID mode? Ask AWS support second question: if I use PID mode, does my application still receive the SIGTERM? we have found sigterm hasn’t made it to our containers, in the past.
essential over to api - fix it in the PR
Update: AWS support is running some tests related to PID and Sigterm (related to my current ticket), and should get back to us early tomorrow.
I mentioned ddtrace, and so he plans to include that in some additional tests! (Currently we are not utilizing apm with celery. I'm in discussion with Kyle and Michael on this, as Michael could use that data for his dynamoDB work)
Also, Datadog responded to me about essential: TLDR: we should have it on both containers, as their guide suggests:
Hi Lucas,
We mention specifying both containers as essential as the application container could still start without the Agent but this would cause the loss of monitoring. Marking the Agent as essential within the same task definition guarantees that the Agent is also started as a sidecar to the application. AWS mentions that at least one container should be marked as essential but this does not preclude additional containers within the task from being marked as such.
Sincerely, Spencer Phillips | Solutions Engineer | Datadog
I continue to wait for AWS next response. Meanwhile, I'm engaging in some DevOps study.
I have been helping Mack for the last little bit. AWS support still hasn't responded. Sent them a message:
This is great! We're awaiting your response. I appreciate the details you can provide for our own documentation on this situation!
AWS responded, I've been reading through for understanding and have since started a deployment with my branch.
Preparing for a sync with Kyle on this
A final test to ensure celery tasks are getting graceful shutdown is by them logging that shutdown. Ran this test with Kyle by deploying pidMOde on top of previous PidMode deployment, and things are looking good:
I'm going to remove essential from my PR and open this tomorrow morning for review
Merged! Moving this to the QA lane.
I found the logs most helpful for development. the above log search query with pool and shutdown proves the pidmode didn't interfere with graceful shutdown. or exit in the same dev-notification-api-log-group and dev-notification-api-datadog-log-gropu, with a timespan of a week, will show these errors coming and going as I messed with deployments, and ultimately gone away since the deploys today.
Also, the containers are clearly not exiting anymore on the containers page; if the containers were exiting, this would be apparent in more recent start times for some failing containers, but since they are all the same since the last deployment, the fix appears to be working. :
Value statement: As a VA Notify Engineer I want to stop our containers from constantly restarting So that our system remains stable
As a Product Stakeholder I want our service to be stable for our clients So that we consistently provide the best service possible
Description
Our Datadog and notificatino-api containers are constantly exiting out (see images).
Additional Info and Resources
This may be to a task definition configuration or an issue with how the forwarder is configured, and I would suggest looking into container configuration and the latest image pushed to ECR first before looking at the forwarder configuration
Engineering Checklist
Steps to Reproduce
This is an active problem that can be seen here:
Impact/Urgency
This should be considered medium priority. There seems to be no impact to our datadog currently, but because our containers are actively exiting and this is a critical tool for our SRE, this should be addresses sooner rather than later.
Expected Behavior
Additional Info & Resources