Open vignesh-codes opened 12 months ago
Another logs example when OTEL pod reaches 95% memory utilisation.
2023-10-09T14:41:22.020Z info memorylimiterprocessor@v0.85.0/memorylimiter.go:266 Memory usage after GC. {"kind": "processor", "name": "memory_limiter", "pipeline": "logs/otlp/auditlogs", "cur_mem_mib": 6408} 2023-10-09T14:41:22.020Z warn memorylimiterprocessor@v0.85.0/memorylimiter.go:276 Memory usage is above hard limit. Forcing a GC. {"kind": "processor", "name": "memory_limiter", "pipeline": "logs/otlp/auditlogs", "cur_mem_mib": 6408}
After few hours pod got OOM killed.
Thank you for reporting this @vignesh-codes! We were able to fix a bug in our k8s_tagger processor, and have likely found one in the upstream k8sattributes
processor as well.
I am afraid I can't say with any confidence that this bug is the root of your troubles, though. The nature of the bug suggests to me that the collector had been in trouble for some time.
It is a reality that telemetry data can be sent to a collector faster than it can be processed and forwarded on to sumo logic (or any other service.) Our default pipeline configurations include the memory limiter processor to attempt to apply back pressure in these situations, refusing to accept new telemetry data. It looks like you've already discovered that this tact is not sufficient on its own. The collector needs to be scaled.
If you are using our helm chart, we do have a solution for autoscaling. Relevant docs for the stable release on autoscaling.
Hey team,
I saw a sudden raise in memory and GC not reducing the memory consumptions in my pod. image - sumologic-otel-collector:0.85.0-sumo-0
The pod memory goes beyond 90% of defined limit and did not come back normal. After I manually restarted the pod, the memory went fine.
Upon checking the logs, we found that memory usage went above hard limit but after gc, it did not come down. Also found a couple of error logs stating triggered by k8s_tagger processor
We used the default values
I also see that in restarted pods previous logs from multiple clusters the memory went above hard limit but did not come down after gc and lots of that type errors
Lots of these logs are found and it remains elevated often.
We also see increased error count in log collector pod which forwards data to this memory bound otel pod.
I want to know the following
Expected behavior: