envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
24.93k stars 4.8k forks source link

FluentdAccessLoggerImpl::flush memory leak #35573

Closed triplewy closed 1 month ago

triplewy commented 2 months ago

Description: I think we are seeing a memory leak when using Fluentd access log. I have attached the heap profile in the issue.

I am unsure if this is reasonable for an Envoy instance that's running with 8 cores and 16G of memory and handling 50K active connections with a rate of around 2k new connections per second. screenshot-20240802-154253

I am not too familiar with how Slice is used here but it seems to be created from this piece of code: https://github.com/envoyproxy/envoy/blob/6c645a13d190a75cbd957e208e469ee5078eb4c9/source/extensions/access_loggers/fluentd/fluentd_access_log_impl.cc#L103-L106

Does Buffer::OwnedImpl data need to be explicitly freed here? It appears that

Profile: profile002.pdf

Edit: After accidentally turning off the fluentd access log filter, we found that the memory usage stabilizes. Once we enabled fluentd access log again, the memory usage kept increasing linearly over time. So this makes me believe the issue is in the way Buffer::OwnedImpl is used here.

RyanTheOptimist commented 2 months ago

cc: @ohadvano cc: @wbpcode

ohadvano commented 2 months ago

@triplewy thanks for reporting this. However, I'm not sure if Buffer::OwnedImpl would require explicit free here. I think that when we call write(data), the AsyncTcpClient would call write() on the ClientConnection, and it should consume the data in the buffer. @botengyao please correct me if I am wrong.

Can you elaborate on what exactly do we see in the diagram? Not sure how to read it. Given that we're also using this extension in production, and we don't see memory increase (we observe the server.memory_allocated and server.memory_heap_size stats), can you please confirm that when you use the extension, the upstream Fluentd server indeed receives the events?

triplewy commented 2 months ago

I see. So your theory is that the upstream fluentd server is not actually receiving anything and that causes memory pileup?

From what I can tell, the entries lost counter does not increase so I assumed that meant the upstream is receiving the logs. But I will check today.

I can post our grafana screenshots with fluentd log enabled and disabled to give a better idea how it affects memory.

ohadvano commented 2 months ago

Entries lost would increase in case the buffer is already full, and a new log is written while the buffer is full. However, the buffer has some capacity that, unless the events are successfully sent to the remote server, will increase until that threshold (by default it is 16KB).

But what's weird is that if we get to the write() call in flush(), then clearBuffer() should also be called afterwards. So I'm a bit confused.

Again, I'm not sure if I'm reading the diagram correctly, if you can do a summary that would be great. Do the blocks should represent the stack trace? What does it mean that the Slice 4516.60MB (54.54%) with bigger box and font? Why is there no other blocks after it other than access_allocated:hot

Thanks

triplewy commented 2 months ago

@ohadvano I'm not too sure how to read a heap profile as well but I think it represents how much memory gets allocated for each function. I'm not sure if it also tracks frees. Right now our biggest clue is turning the fluentd access log filter on and off results in significant differences of memory usage. I'll send the graphs for that later today.

triplewy commented 2 months ago

@ohadvano Here is an image showing the behavior of heap usage avg(envoy_server_memory_heap_size{cluster='pbc1a', job='dsa-prod'}) / (1024 * 1024) from fluentd access log. It appears that fluentd access log adds 8G of memory for 50k active downstream connections which doesn't seem reasonable. image

ohadvano commented 2 months ago

The graph shows aggregation for a cluster? Or a single instance? From what your wrote it seems like a cluster. Can you please provide the same for a specific instance?

Also still didn't understand whether the upstream receives the logs in your case?

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions[bot] commented 1 month ago

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.