Open jimleroyer opened 8 months ago
This was fixed ( IMDS filter version) last week. No related errors since. Might want to wait another week...
This is now working. There is a similar issue happening when cwagent crashes due to memory read/write errors. We will open a separate card for this.
Describe the bug
The Celery pods fail to write metrics to the socket. We see an out of memory error on the CWAgent pods where this occurs.
The pods recovers from this OOM error but Celery does not seem to send logs properly even afterward.
Bug Severity
See examples in the documentation
Level: SEV-2 Major
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Impact
Impact on Notify users: None
Impact on Recipients: None
Impact on Notify team: Support time
Additional context
Incident
An incident was called as this is triggered supprot alarms. The incident channel is #incident-2023-11-02-celery-pods-errors.
Investigation
The CloudWatch agent running on spot instances might have a mismatch our on-demand vs spot instances. More specifically, a FluentBit configured filter is looking for their version and both are wrong. It seems that FluentBit cannot carry metrics info because the IMDS cannot be properly shared:
[2023/11/02 14:40:49] [error] [filter:aws:aws.5] Could not retrieve ec2 metadata from IMDS