cds-snc / notification-planning-core

Project planning for GC Notify Core Team
0 stars 0 forks source link

Cloudwatch Agent is crashing due to memory conflicts #222

Open ben851 opened 7 months ago

ben851 commented 7 months ago

Describe the bug

Cloudwatch occasionally crashes with a Go Runtime stacktrace about memory read/write conflicts.

When this happens, kubernetes successfully restarts the pod, but we get celery errors with metric timeouts due to the hiccup. It seems as though celery is able to recover (to be verified)

[Bug Severity]

SEV-2

To Reproduce

Unable to reproduce in a consistent manner, however it may be specific to EC2 spot instances (to be confirmed)

Expected behavior

Cloudwatch agent should not crash

Impact

Impact on Notify users: None

Impact on Recipients: None

Impact on Notify team: Potential gaps in celery metrics. Potential to page whoever is on call in the middle of the night.

Additional context

Since the nodes get recycled, I don't have logs to paste here yet. The next time this occurs I will add the logs.