Investigate Evicted Celery Pods in Staging

ben851 commented 1 year ago

Describe the bug

In staging, Celery pods are being evicted without providing any reason. Looking at the node health, it appears as though they are heavily overcommitted with memory requests. We need to investigate the reason for the eviction, and determine a best course of action

The pods are being evicted due to disk pressure. I suspect that this is related to docker image caching on the OS disk of the node.

Message: Pod The node had condition: [DiskPressure].

Bug Severity

SEV-2 Major

To Reproduce

Need to verify this wasn't a one time occurence. Kubectl get pods -n notification-canada-ca in staging should not show any evicted pods.

Expected behavior

There should be no evicted pods

Impact

This clutters our K8s namespace and also could introduce potential performance issues

QA

Monitor production and staging environments to make sure there are no more evicted pods for 1 week since Sept 7th.

ben851 commented 1 year ago

Will have to put this on the back burner since the nodes with disk pressure were replaced when doing the kubernetes upgrade to 1.25

ben851 commented 1 year ago

I did some investigation - it appears that log files are potentially taking up space.
Not sure that this is docker image related since EKS apparently does a sweep and clean very frequently

ben851 commented 1 year ago

Further investigation identifies this as ephemeral storage disk pressure issues.
It doesn't appear as though any pods are taking up a huge amount of ephemeral storage. Nodes are reporting 0% usage by pods.
The two biggest culprits could be docker images or host logs. Given that we are not experiencing the same issues in production, I am thinking it's more likely to be docker images than logs.
I would like to install grafana/prometheus combo into staging to see if I can find some more data.

ben851 commented 1 year ago

Grafana/Prometheus installed, monitoring the system

jimleroyer commented 1 year ago

We have to do a bunch of 💩 in staging.

sastels commented 12 months ago

Nothing evicted lately, waiting so that we can get some data

ben851 commented 12 months ago

Reviewed the last weeks worth of data in grafana staging and it did not occur. Moving this to blocked for now, possibly back to ice box?

ben851 commented 12 months ago

@ben851 to create an alert when pods are evicted.

ben851 commented 12 months ago

New PR opened to alert on evicted pod status

jimleroyer commented 11 months ago

Moving the task to icebox until we get notified with the new alert setup.

ben851 commented 11 months ago

Alarms triggered in production today. The pods were evicted due to node disk pressure.

We verified that fluent bit logs are not taking up a huge amount of space, but disks are getting full.

While we can't confirm, we are pretty sure that the image cache is taking up too much space, and not clearing enough room.

We are going to increase the node disk size

jimleroyer commented 11 months ago

Disk size is upgraded on the nodes but Terraform didn't recognize it. This will require to recreate the nodes. Steve to drop a link here for the script to recreate the nodes. We need to test in staging environment, especially to catch the 502 errors.

sastels commented 11 months ago

https://github.com/cds-snc/notification-terraform/pull/869

jimleroyer commented 11 months ago

Steve and Ben to test the new eviction script in staging environment.

sastels commented 10 months ago

upgraded staging to 80G nodes using the new script to move primary -> secondary -> primary
ran a soak test doing a GET to https://staging.notification.cdssandbox.xyz/_status every second
no errors

ben851 commented 10 months ago

Created PR to enable the secondary nodes w/ 80gb disks in production.

sastels commented 10 months ago

need to do a release to get the secondary nodes in prod

sastels commented 10 months ago

secondary nodes now in prod

sastels commented 10 months ago

application moved to secondary in prod this morning. primary upgrade and move back scheduled for today

ben851 commented 10 months ago

Primary nodes upgraded, need to delete secondary nodes

ben851 commented 10 months ago

Will release config to delete secondary nodes tomorrow.

ben851 commented 10 months ago

We will wait one more week to determine if any evicted pods show up.

jimleroyer commented 9 months ago

Checked production and there are no evited pods so far.

ben851 commented 9 months ago

If no pods by tomorrow, will move to done.

cds-snc / notification-planning-core