cds-snc / notification-planning-core

Project planning for GC Notify Core Team
0 stars 0 forks source link

Investigate Evicted Celery Pods in Staging #145

Open ben851 opened 1 year ago

ben851 commented 1 year ago

Describe the bug

In staging, Celery pods are being evicted without providing any reason. Looking at the node health, it appears as though they are heavily overcommitted with memory requests. We need to investigate the reason for the eviction, and determine a best course of action

The pods are being evicted due to disk pressure. I suspect that this is related to docker image caching on the OS disk of the node.

Message: Pod The node had condition: [DiskPressure].

Bug Severity

SEV-2 Major

To Reproduce

Need to verify this wasn't a one time occurence. Kubectl get pods -n notification-canada-ca in staging should not show any evicted pods.

Expected behavior

There should be no evicted pods

Impact

This clutters our K8s namespace and also could introduce potential performance issues

QA

Monitor production and staging environments to make sure there are no more evicted pods for 1 week since Sept 7th.

ben851 commented 1 year ago
ben851 commented 1 year ago
ben851 commented 1 year ago
ben851 commented 1 year ago
jimleroyer commented 1 year ago

We have to do a bunch of 💩 in staging.

sastels commented 12 months ago

Nothing evicted lately, waiting so that we can get some data

ben851 commented 12 months ago

Reviewed the last weeks worth of data in grafana staging and it did not occur. Moving this to blocked for now, possibly back to ice box?

ben851 commented 12 months ago

@ben851 to create an alert when pods are evicted.

ben851 commented 12 months ago

New PR opened to alert on evicted pod status

jimleroyer commented 11 months ago

Moving the task to icebox until we get notified with the new alert setup.

ben851 commented 11 months ago

Alarms triggered in production today. The pods were evicted due to node disk pressure.

We verified that fluent bit logs are not taking up a huge amount of space, but disks are getting full.

While we can't confirm, we are pretty sure that the image cache is taking up too much space, and not clearing enough room.

We are going to increase the node disk size

jimleroyer commented 11 months ago

Disk size is upgraded on the nodes but Terraform didn't recognize it. This will require to recreate the nodes. Steve to drop a link here for the script to recreate the nodes. We need to test in staging environment, especially to catch the 502 errors.

sastels commented 11 months ago

https://github.com/cds-snc/notification-terraform/pull/869

jimleroyer commented 11 months ago

Steve and Ben to test the new eviction script in staging environment.

sastels commented 10 months ago
ben851 commented 10 months ago
sastels commented 10 months ago

need to do a release to get the secondary nodes in prod

sastels commented 10 months ago

secondary nodes now in prod

sastels commented 10 months ago

application moved to secondary in prod this morning. primary upgrade and move back scheduled for today

ben851 commented 10 months ago

Primary nodes upgraded, need to delete secondary nodes

ben851 commented 10 months ago

Will release config to delete secondary nodes tomorrow.

ben851 commented 10 months ago

We will wait one more week to determine if any evicted pods show up.

jimleroyer commented 9 months ago

Checked production and there are no evited pods so far.

ben851 commented 9 months ago

If no pods by tomorrow, will move to done.