Open ben851 opened 1 year ago
We have to do a bunch of 💩 in staging.
Nothing evicted lately, waiting so that we can get some data
Reviewed the last weeks worth of data in grafana staging and it did not occur. Moving this to blocked for now, possibly back to ice box?
@ben851 to create an alert when pods are evicted.
New PR opened to alert on evicted pod status
Moving the task to icebox until we get notified with the new alert setup.
Alarms triggered in production today. The pods were evicted due to node disk pressure.
We verified that fluent bit logs are not taking up a huge amount of space, but disks are getting full.
While we can't confirm, we are pretty sure that the image cache is taking up too much space, and not clearing enough room.
We are going to increase the node disk size
Disk size is upgraded on the nodes but Terraform didn't recognize it. This will require to recreate the nodes. Steve to drop a link here for the script to recreate the nodes. We need to test in staging environment, especially to catch the 502 errors.
Steve and Ben to test the new eviction script in staging environment.
https://staging.notification.cdssandbox.xyz/_status
every secondneed to do a release to get the secondary nodes in prod
secondary nodes now in prod
application moved to secondary in prod this morning. primary upgrade and move back scheduled for today
Primary nodes upgraded, need to delete secondary nodes
Will release config to delete secondary nodes tomorrow.
We will wait one more week to determine if any evicted pods show up.
Checked production and there are no evited pods so far.
If no pods by tomorrow, will move to done.
Describe the bug
In staging, Celery pods are being evicted without providing any reason. Looking at the node health, it appears as though they are heavily overcommitted with memory requests. We need to investigate the reason for the eviction, and determine a best course of action
The pods are being evicted due to disk pressure. I suspect that this is related to docker image caching on the OS disk of the node.
Message: Pod The node had condition: [DiskPressure].
Bug Severity
SEV-2 Major
To Reproduce
Need to verify this wasn't a one time occurence. Kubectl get pods -n notification-canada-ca in staging should not show any evicted pods.
Expected behavior
There should be no evicted pods
Impact
This clutters our K8s namespace and also could introduce potential performance issues
QA
Monitor production and staging environments to make sure there are no more evicted pods for 1 week since Sept 7th.