Insertions to MongoDB suddenly stops. BufferOverflowError

FraLarghi commented 4 years ago

Hi everyone,

I have a problem with insertions in MongoDB with my fluentd instances. I use a fluentd container self built on top of latest fluentd:1.7 version (I've also used previous versions), with just a few more plugins installed.

My conf is quite simple, I only need high throughput. Systematically my fluentd instance starts to fill RAM memory apparently without a motivation and eventually stops to flush chunks inside Mongodb. Of course when it happens, the buffer starts to grow indefinitely until I restart the container. Maybe the rate is too high? But i have powerful machines dedicated and it seems weird to me. When the container is restarted all the logs are correctly inserted in few seconds and this situation keeps representing after a few hours or days. The fact is that the cpu is not working at full load at all, it does that only when i restart the container and flushes all the logs correctly in Mongo in a couple of seconds.

graphs

Is there a way to speed up the insertion rate, stressing more the cpu? I've already configured flush_thread_count to maximum (8 or 16), and workers to 8. But they only use a small percentage of the cores power, while they keep filling the RAM memory with the buffer.

These are the logs printed in tons per seconds when it suddenly stops to flush the buffer. Before that is everything clean and ok in the logs (no plugin is able to use the buffer and it stops):

2019-12-12 09:11:51 +0000 [error]: #0 [in_fortinet] unexpected error in processing UDP data error_class=Fluent::Plugin::Buffer::BufferOverflowError error="buffer space has too many data"
2019-12-12 09:11:51 +0000 [error]: #0 suppressed same stacktrace
2019-12-12 09:11:51 +0000 [warn]: #0 failed to write data into buffer by buffer overflow action=:throw_exception
2019-12-12 09:11:51 +0000 [warn]: #0 emit transaction failed: error_class=Fluent::Plugin::Buffer::BufferOverflowError error="buffer space has too many data" location="/usr/lib/ruby/gems/2.5.0/gems/fluentd-1.7.4/lib/fluent/plugin/buffer.rb:265:in `write'" tag="fluent.warn"
2019-12-12 09:11:51 +0000 [warn]: #0 suppressed same stacktrace
2019-12-12 09:11:51 +0000 [warn]: #0 emit transaction failed: error_class=Fluent::Plugin::Buffer::BufferOverflowError error="buffer space has too many data" location="/usr/lib/ruby/gems/2.5.0/gems/fluentd-1.7.4/lib/fluent/plugin/buffer.rb:265:in `write'" tag="fluent.warn"
2019-12-12 09:11:51 +0000 [warn]: #0 suppressed same stacktrace
2019-12-12 09:11:51 +0000 [error]: #0 failed to emit fluentd's log event tag="fluent.warn" event={"action"=>:throw_exception, "message"=>"failed to write data into buffer by buffer overflow action=:throw_exception"} error_class=Fluent::Plugin::Buffer::BufferOverflowError error="buffer space has too many data"
2019-12-12 09:11:51 +0000 [warn]: #0 failed to write data into buffer by buffer overflow action=:throw_exception
2019-12-12 09:11:51 +0000 [warn]: #0 emit transaction failed: error_class=Fluent::Plugin::Buffer::BufferOverflowError error="buffer space has too many data" location="/usr/lib/ruby/gems/2.5.0/gems/fluentd-1.7.4/lib/fluent/plugin/buffer.rb:265:in `write'" tag="fluent.warn"

I would like at least to understand where is the bottleneck. I'm struggling trying to make it solid for a huge deployment asap. Thank you all for your support!

repeatedly commented 4 years ago

Your metric is very strange. MongoDB operations metric shows no operation but your network usage is not decreased. No warning or error logs at 8:40 - 8:50? If the bottleneck exists, Fluentd Buffer length will be increased gradually. But its value is suddenly increased...

FraLarghi commented 4 years ago

Hi! Thank you so much for your reply.

The network usage graph refers to the incoming log traffic to the machine. The weird fact (a part from this particular fast growth) is that the RAM usage of fluentd container always keeps growing slowly, but the CPU usage is always very very low (I have 32 cores on that machine and workers set to 16), I would like to use it more and keep the RAM low instead.

I thought it was the same problem raised here https://github.com/fluent/fluentd/issues/2580 so i disabled the flush_thread_count feature, but it seems it doesn't affect the RAM growth at all.

This is the graph of the last hours that shows the constant growth on that machine.

graphs3

(the peaks are probably related to the sources actual traffic that peaks every 2 hours)

On other smaller machines, I have a crontab that restart the container every 2 hours to force the flush and empty the buffer (therefore the RAM) on my machines, to avoid any overflow problem.

I have disabled the restart on all my machines now, later I will post the result graphs and logs to investigate more on this strange behavior. Thank you again

7c commented 4 years ago

I do have similar problem. probably same.. I do have 30 log processing servers they do have mongo on them to output incoming. I have increased servers to 60, did not help much. Once i restart td-agent it keeps working for like 10 minutes then it stops..

after certain time td-agent STOPS flushing on 1 tag and keeps operating on another tag. The stopped tag has very high throughput and the one did not stop has much lower (like 1/50) throughput.

memory seems to be similar with you, it starts filling the memory. Restarting td-agent is for now only workaround i am doing but i do probably lose a lot . Any recommendations is welcome

Here are fluentd monitoring metrics of single server:

i am not sure if i understand 'buffer queue size' metrics correcty but after minutes it starts filling this buffer instead of flushing

fluent / fluent-plugin-mongo

Insertions to MongoDB suddenly stops. BufferOverflowError #135