cds-snc / notification-planning-core

Project planning for GC Notify Core Team
0 stars 0 forks source link

The Celery pods fail to write metrics to the socket #219

Open jimleroyer opened 8 months ago

jimleroyer commented 8 months ago

Describe the bug

The Celery pods fail to write metrics to the socket. We see an out of memory error on the CWAgent pods where this occurs.

The pods recovers from this OOM error but Celery does not seem to send logs properly even afterward.

Bug Severity

See examples in the documentation

Level: SEV-2 Major

To Reproduce

Steps to reproduce the behavior:

  1. Enable on-demand and spot instances in our Karpenter node selector configuration.
  2. Load the environment with performance load in order to bring up new nodes, in order to kick off bringing up new CWAgent daemonsets.

Expected behavior

Impact

Impact on Notify users: None

Impact on Recipients: None

Impact on Notify team: Support time

Additional context

Incident

An incident was called as this is triggered supprot alarms. The incident channel is #incident-2023-11-02-celery-pods-errors.

Investigation

The CloudWatch agent running on spot instances might have a mismatch our on-demand vs spot instances. More specifically, a FluentBit configured filter is looking for their version and both are wrong. It seems that FluentBit cannot carry metrics info because the IMDS cannot be properly shared: [2023/11/02 14:40:49] [error] [filter:aws:aws.5] Could not retrieve ec2 metadata from IMDS

sastels commented 8 months ago

This was fixed ( IMDS filter version) last week. No related errors since. Might want to wait another week...

ben851 commented 7 months ago

This is now working. There is a similar issue happening when cwagent crashes due to memory read/write errors. We will open a separate card for this.