Cloudwatch occasionally crashes with a Go Runtime stacktrace about memory read/write conflicts.
When this happens, kubernetes successfully restarts the pod, but we get celery errors with metric timeouts due to the hiccup. It seems as though celery is able to recover (to be verified)
[Bug Severity]
SEV-2
To Reproduce
Unable to reproduce in a consistent manner, however it may be specific to EC2 spot instances (to be confirmed)
Expected behavior
Cloudwatch agent should not crash
Impact
Impact on Notify users:
None
Impact on Recipients:
None
Impact on Notify team:
Potential gaps in celery metrics.
Potential to page whoever is on call in the middle of the night.
Additional context
Since the nodes get recycled, I don't have logs to paste here yet. The next time this occurs I will add the logs.
Describe the bug
Cloudwatch occasionally crashes with a Go Runtime stacktrace about memory read/write conflicts.
When this happens, kubernetes successfully restarts the pod, but we get celery errors with metric timeouts due to the hiccup. It seems as though celery is able to recover (to be verified)
[Bug Severity]
SEV-2
To Reproduce
Unable to reproduce in a consistent manner, however it may be specific to EC2 spot instances (to be confirmed)
Expected behavior
Cloudwatch agent should not crash
Impact
Impact on Notify users: None
Impact on Recipients: None
Impact on Notify team: Potential gaps in celery metrics. Potential to page whoever is on call in the middle of the night.
Additional context
Since the nodes get recycled, I don't have logs to paste here yet. The next time this occurs I will add the logs.