cds-snc / notification-planning-core

Project planning for GC Notify Core Team
0 stars 0 forks source link

Fluent Bit Emitters are paused during high load #221

Open ben851 opened 10 months ago

ben851 commented 10 months ago

Describe the bug

When running under high load, fluent-bit emitters for celery are being paused due to memory buffer limits, leading to log loss.

To Reproduce

This is difficult to reproduce but seems to occur when karpenter has provisioned a large single node to run all celery instances at once. The volume of logs coming through the pipeline overloads fluent bit.

Expected behavior

Fluent bit should continue to process logs even under high load

Impact

Potential log loss during critical periods

ben851 commented 10 months ago

From slack: The second issue is that under high load we're seeing the fluentbit emitter pausing due to memory buffer limits. I put in higher memory buffer limits in staging, and need to run some more performance tests this morning to see if that resolved the issue. (This one is more difficult to replicate).

ben851 commented 10 months ago

In indirect fix would be to ensure there is a split celery across multiple nodes using a pod topology spread constraint.

ben851 commented 10 months ago

Upped the memory buffer limits of the celery pipeline to 150mb. Seems to have fixed the issue in staging

ben851 commented 10 months ago

Ready to merge to production, will do so today.

ben851 commented 10 months ago

Deployed to production, will evaluate over the coming days.

ben851 commented 10 months ago

No paused emitters over last two days

sastels commented 10 months ago

all still looking good after a week.