Open ben851 opened 10 months ago
From slack: The second issue is that under high load we're seeing the fluentbit emitter pausing due to memory buffer limits. I put in higher memory buffer limits in staging, and need to run some more performance tests this morning to see if that resolved the issue. (This one is more difficult to replicate).
In indirect fix would be to ensure there is a split celery across multiple nodes using a pod topology spread constraint.
Upped the memory buffer limits of the celery pipeline to 150mb. Seems to have fixed the issue in staging
Ready to merge to production, will do so today.
Deployed to production, will evaluate over the coming days.
No paused emitters over last two days
all still looking good after a week.
Describe the bug
When running under high load, fluent-bit emitters for celery are being paused due to memory buffer limits, leading to log loss.
To Reproduce
This is difficult to reproduce but seems to occur when karpenter has provisioned a large single node to run all celery instances at once. The volume of logs coming through the pipeline overloads fluent bit.
Expected behavior
Fluent bit should continue to process logs even under high load
Impact
Potential log loss during critical periods