BUG: datadog/notification-api Containers constantly exiting

ldraney commented 5 months ago

Value statement: As a VA Notify Engineer I want to stop our containers from constantly restarting So that our system remains stable

As a Product Stakeholder I want our service to be stable for our clients So that we consistently provide the best service possible

Description

Our Datadog and notificatino-api containers are constantly exiting out (see images).

Additional Info and Resources

This may be to a task definition configuration or an issue with how the forwarder is configured, and I would suggest looking into container configuration and the latest image pushed to ECR first before looking at the forwarder configuration

Engineering Checklist

[x] Open support ticket with DD when ticket is picked up
[x] Investigate Datadog image and task definitions; considering environment variables and ddtrace
[x] Investigate when these exits started and if its related to the updated datadog image
[x] Consider whether this is related to the forwarder

Steps to Reproduce

This is an active problem that can be seen here:

Impact/Urgency

This should be considered medium priority. There seems to be no impact to our datadog currently, but because our containers are actively exiting and this is a critical tool for our SRE, this should be addresses sooner rather than later.

Expected Behavior

[x] Datadog containers do not exit, and this is reflected on the containers page in Datadog.

Additional Info & Resources

k-macmillan commented 5 months ago

Investigated this and was unable to find any evidence the API containers were actually exiting. There are no logs showing gunicorn restarting. There are no logs in ECS I could find showing that the containers were restarting. The exit status that Datadog is reporting for API containers comes and goes.

The Datadog agent is 100% having issues though, and does need investigation. Possibly an upgrade to ddtrace.

mjones-oddball commented 5 months ago

Hey team! Please add your planning poker estimate with Zenhub @cris-oddball @EvanParish @k-macmillan @kalbfled @ldraney @MackHalliday @mchlwellman

npmartin-oddball commented 5 months ago

@ldraney to reach out to DD, update ticket with response info...will inform next steps re: working ticket.

ldraney commented 5 months ago

Sent ddog a message, we'll see how they respond! I'll then update this ticket comment and maybe additional info based on that response.

ldraney commented 5 months ago

Got a response from DDog team:

Hi Lucas,

Can you confirm if your Agent deployment configuration has the following?

Name: DD_PROCESS_AGENT_ENABLED Value: "true"

If the above is set, you will need to set the pidMode parameter to task in the task definition for us to properly execute according to our doc here.

Sincerely, Spencer Phillips | Solutions Engineer | Datadog | Business Hours: 10 a.m. to 10 p.m. ET, Monday to Friday

ldraney commented 5 months ago

From the AWS documentation for PID mode:

If task is specified, all containers within the specified task share the same process namespace.

ldraney commented 5 months ago

Which would need to be set up via the task definition here:

  "containerDefinitions": [
    {
      "name": "notification-api",
      "image": "{will-be-replaced-by-ci}",
      "pidMode": "task",

ldraney commented 5 months ago

Okay, updated a branch and preparing to deploy. Before I need to get a little setup in Datadog to verify the containers are still failing BEFORE/AFTER my deploy to dev

ldraney commented 5 months ago

I can't tell how Kyle got the "exited at" column in the above screenshot. Compared to what I see:

ldraney commented 5 months ago

But the logs still clearly show exiting containers, which should be enough for now:

ldraney commented 5 months ago

deploying to dev

ldraney commented 5 months ago

Had to fix task definition based on AWS doc

ldraney commented 5 months ago

The errors kept coming, even after successful deployment. It may have been from the other tasks, so I will try pidmode on those as well.

But first I want to make sure this is related to the DD_ PROCESSAGENT ENABLED variable, as support suggested.

I'm deleting that variable for now and redeploying to see if the errors stop.

Then, I'll try PIDmode one more time. Getting our system processes in fargate seems like a good idea, so I'll try to get that working while I'm here (not outside of scope - should have been the original functionality); but if it doesn't work we'll just consider bug fixed for now, I suppose. Still thinking about it.

ldraney commented 5 months ago

Successful deploy between 1:27-1:29 MST (celery finished deploying last), seeing outcome...

ldraney commented 5 months ago

The logs did stop! Confirmed bug is related to DD_PROCESS_AGENT_ENABLED.

Next step, try readding this with pidMode: task. If not, this is an appropriate time to continue conversation with Datadog.

ldraney commented 5 months ago

Successful deploy with pidmode: task

and logs seem promising! You can see errors have stopped for the last two deploys so far:

I'm going to give it a little time, and also see if we are getting process information in Datadog as the ddog documentation indicates we should.

ldraney commented 5 months ago

We are indeed getting new process information:

Before

After

ldraney commented 5 months ago

Things are looking good! I'm going to adjust the other environments, check on these pages again to confirm containers have stopped exiting, and open the PR with appropriate links in the "tests" section

ldraney commented 5 months ago

Having a conversation with Kyle related to the risks of this feature. Perhaps more notably, I noticed that the datadog container is marked as "essential", where it should actually be the notification-api container that is essential! That should be addressed asap, so that datadog failures don't affect notification-api containers and tasks.

https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_definition_parameters.html#ContainerDefinition-essential

ldraney commented 5 months ago

The bug seems to be resolved, but there some mystery in these settings; Kyle wants a more robust investigation with the related configurations and their risks -- PidMode, Essential, and how our usage of PID may affect these settings -- all related to application stability.

I'm gathering documentation and preparing for effective conversation with AWS and maybe Datadog support.

ldraney commented 5 months ago

I've sent a Datadog support and AWS support ticket

I've also collected the most relevant documentation I could find.

I'll synthesize findings and report soon.

Another thought: testing a ddog container failure is also a potential option. I can look into that if the above investigation doesn't turn out to be sufficient.

ldraney commented 5 months ago

This page has been the most useful

A useful and relevant TLDR section is at the end, Caveats of sharing a process id namespace: While a process id namespace can now be shared by containers in the same ECS task on AWS Fargate, there a few things to be aware of. We’ll walk through those caveats in the context of the application and sidecar ECS task defined previously: A process in the sidecar container can observe, stop, or restart a process in the application container. When sharing a process in an ECS task, a new pause process runs as PID 1 for the whole task.

So, that explains the pause task we see now. It also means that, as long as the datadog container has no functionality that can stop the notification-api container (I would be greatly surprised if it did), that notification-api should run smoothly, regardless of the status of the container. To address your point earlier that things have crashed due to datadog, this should be moreso attributed to our ddtrace setup, and I'm in communication with the datadog team on this. This is because Fargate is essentially running docker-compose, each container has its own isolated resources, so even if one container crashed, it will not affect the other container unless there was a dependency relationship (e.g. notification-api depending upon datadog, which it doesn't).

I'm also asking Datadog about their guidance on essential, as their own guide declares essential on both containers in the task, which doesn't make sense to me.

ldraney commented 5 months ago

From AWS customer support:

Setting pidMode to task would not cause all containers to fail by default. If, for example, you had a task with 3 containers with pidMode set to task where 2 of the tasks did something small like download a file and then exit. Having those containers exit would not automatically cause the whole task to fail even if they exited with a non-zero exit code.

ldraney commented 5 months ago

AWS support is asking for more information on how we use PID, though assumes we are using it for graceful shutdown with ECS. Confirming this with Kyle. We may need to document our situation in this regard in our own team repo.

ldraney commented 5 months ago

I see related PID settings in gunicorn_config.py

ldraney commented 5 months ago

Most useful documentation so far: Announcing additional Linux controls for Amazon ECS tasks on AWS Fargate, which addresses some of our main concerns.

ldraney commented 5 months ago

Notes from my latest convo with Kyle: PIDmode with exec - will it be fine? in our task definition we want run_celery.sh What is the impact of using exec with PID mode? Ask AWS support second question: if I use PID mode, does my application still receive the SIGTERM? we have found sigterm hasn’t made it to our containers, in the past.

essential over to api - fix it in the PR

ldraney commented 5 months ago

Update: AWS support is running some tests related to PID and Sigterm (related to my current ticket), and should get back to us early tomorrow.

I mentioned ddtrace, and so he plans to include that in some additional tests! (Currently we are not utilizing apm with celery. I'm in discussion with Kyle and Michael on this, as Michael could use that data for his dynamoDB work)

ldraney commented 5 months ago

Also, Datadog responded to me about essential: TLDR: we should have it on both containers, as their guide suggests:

Hi Lucas,

We mention specifying both containers as essential as the application container could still start without the Agent but this would cause the loss of monitoring. Marking the Agent as essential within the same task definition guarantees that the Agent is also started as a sidecar to the application. AWS mentions that at least one container should be marked as essential but this does not preclude additional containers within the task from being marked as such.

Sincerely, Spencer Phillips | Solutions Engineer | Datadog

ldraney commented 5 months ago

I continue to wait for AWS next response. Meanwhile, I'm engaging in some DevOps study.

ldraney commented 5 months ago

I have been helping Mack for the last little bit. AWS support still hasn't responded. Sent them a message:

This is great! We're awaiting your response. I appreciate the details you can provide for our own documentation on this situation!

ldraney commented 5 months ago

AWS responded, I've been reading through for understanding and have since started a deployment with my branch.

ldraney commented 5 months ago

successful deployment as of 2:25 MST. Checking logs

ldraney commented 5 months ago

so far looking good

Preparing for a sync with Kyle on this

ldraney commented 5 months ago

A final test to ensure celery tasks are getting graceful shutdown is by them logging that shutdown. Ran this test with Kyle by deploying pidMOde on top of previous PidMode deployment, and things are looking good:

ldraney commented 5 months ago

I'm going to remove essential from my PR and open this tomorrow morning for review

ldraney commented 5 months ago

Merged! Moving this to the QA lane.

ldraney commented 5 months ago

I found the logs most helpful for development. the above log search query with pool and shutdown proves the pidmode didn't interfere with graceful shutdown. or exit in the same dev-notification-api-log-group and dev-notification-api-datadog-log-gropu, with a timespan of a week, will show these errors coming and going as I messed with deployments, and ultimately gone away since the deploys today.

Also, the containers are clearly not exiting anymore on the containers page; if the containers were exiting, this would be apparent in more recent start times for some failing containers, but since they are all the same since the last deployment, the fix appears to be working. :

department-of-veterans-affairs / notification-api