fluent / fluentd

Fluentd: Unified Logging Layer (project under CNCF)
https://www.fluentd.org
Apache License 2.0
12.86k stars 1.34k forks source link

Ruby uplift 2.5 -> 3.3 makes Fluentd startup 60x slower #4545

Closed dinatamas closed 1 month ago

dinatamas commented 3 months ago

Describe the bug

I am running fluentd in a docker container, and noticed that when I uplift ruby from version 2.5 (provided by the OS) to version 3.3 (built by me), then the startup time of fluentd increases significantly (from <1 second to multiple minutes).

The main bottleneck is the CPU:

PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 19 normali+  20   0  479604  20616   8088 R 99.45 0.031   0:17.68 ruby /usr/local/bin/fluentd -vv -c /fluentd/fluent.conf

and then later:

PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
156 normali+  20   0  487152  27128   8436 R 99.67 0.041   0:43.26 /usr/local/bin/ruby -Eascii-8bit:ascii-8bit /usr/local/bin/fluentd -vv -c /fluentd/fluent.conf --under-supervisor
154 normali+  20   0  487232  27504   8552 R 98.34 0.042   0:43.53 /usr/local/bin/ruby -Eascii-8bit:ascii-8bit /usr/local/bin/fluentd -vv -c /fluentd/fluent.conf --under-supervisor

The high CPU load only stops once the fluentd worker is now running worker messages appear.

I ran an strace, and found that calls like the following happen hundreds of times each second:

2024-06-27T15:09:47.844482218Z clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=0, tv_nsec=49017062}) = 0
2024-06-27T15:09:47.844783480Z clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=0, tv_nsec=49024759}) = 0
2024-06-27T15:09:47.845002115Z clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=0, tv_nsec=49170943}) = 0
2024-06-27T15:09:47.845227095Z clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=0, tv_nsec=49231414}) = 0

To Reproduce

I was unable to reproduce the issue outside my environment, even when I tried to build ruby the same way and install the same plugins. But for me it happens very reliably, each time I start the container. I have the full strace output, but it's 260'000 lines long, and 240'000 of it are just those clock_gettime() calls.

Your Environment

Note: Uplifting fluentd to v1.17.0 does not help.

Gemfile:

gem 'fluentd', '1.15.3'
gem 'fluent-plugin-rewrite-tag-filter', '2.4.0'
gem 'fluent-plugin-grok-parser', '2.6.2'
gem 'fluent-plugin-kafka', '0.15.3'
gem 'fluent-plugin-record-modifier', '2.1.0'
gem 'fluent-plugin-prometheus', '2.0.2'
gem 'fluent-plugin-opensearch', '1.0.4'
gem 'fluent-plugin-flatten-hash', '0.5.1'
gem 'oj', '3.12.2'

Your Configuration

I am deliberately running a simple - basically empty - config, and even with this the startup takes a very long time. It's dumped in the following error log. It also takes ~1 minute to simply execute fluentd --version.

Find the fluentd trace logs of the startup below. There is almost an entire minute between the first line (the fluentd command being issued) and the first log message from fluentd:

2024-06-27T16:09:52.030112275Z + fluentd -vv -c /fluentd/fluent.conf
2024-06-27T16:10:52.076792835Z 2024-06-27 16:10:52 +0000 [info]: fluent/log.rb:330:info: init supervisor logger path=nil rotate_age=nil rotate_size=nil
2024-06-27T16:10:52.214539070Z 2024-06-27 16:10:52 +0000 [info]: fluent/log.rb:330:info: parsing config file is succeeded path="/fluentd/fluent.conf"
2024-06-27T16:10:57.101598357Z 2024-06-27 16:10:57 +0000 [trace]: fluent/log.rb:287:trace: registered input plugin 'monitor_agent'
2024-06-27T16:10:58.726205882Z 2024-06-27 16:10:58 +0000 [trace]: fluent/log.rb:287:trace: registered metrics plugin 'local'
2024-06-27T16:10:58.739464606Z 2024-06-27 16:10:58 +0000 [debug]: fluent/log.rb:309:debug: No fluent logger for internal event
2024-06-27T16:10:58.752734175Z 2024-06-27 16:10:58 +0000 [info]: fluent/log.rb:330:info: using configuration file: <ROOT>
2024-06-27T16:10:58.753023723Z   <system>
2024-06-27T16:10:58.753036875Z     workers 2
2024-06-27T16:10:58.753040772Z     ignore_repeated_log_interval 60s
2024-06-27T16:10:58.753128642Z     ignore_same_log_interval 60s
2024-06-27T16:10:58.753519418Z   </system>
2024-06-27T16:10:58.753529798Z   <source>
2024-06-27T16:10:58.753534386Z     @type monitor_agent
2024-06-27T16:10:58.753538031Z     bind "0.0.0.0"
2024-06-27T16:10:58.753542108Z     port 24220
2024-06-27T16:10:58.753644256Z   </source>
2024-06-27T16:10:58.755709184Z </ROOT>
2024-06-27T16:10:58.759459327Z 2024-06-27 16:10:58 +0000 [info]: fluent/log.rb:330:info: starting fluentd-1.15.3 pid=19 ruby="3.3.1"
2024-06-27T16:10:58.777075542Z 2024-06-27 16:10:58 +0000 [info]: fluent/log.rb:330:info: spawn command to main:  cmdline=["/usr/local/bin/ruby", "-Eascii-8bit:ascii-8bit", "/usr/local/bin/fluentd", "-vv", "-c", "/fluentd/fluent.conf", "--under-supervisor"]
2024-06-27T16:10:58.778966486Z 2024-06-27 16:10:58 +0000 [info]: fluent/log.rb:330:info: init supervisor logger path=nil rotate_age=nil rotate_size=nil
2024-06-27T16:12:00.121158112Z 2024-06-27 16:12:00 +0000 [info]: #0 fluent/log.rb:330:info: init worker0 logger path=nil rotate_age=nil rotate_size=nil
2024-06-27T16:12:00.768508406Z 2024-06-27 16:12:00 +0000 [info]: fluent/log.rb:330:info: adding source type="monitor_agent"
2024-06-27T16:12:01.271596642Z 2024-06-27 16:12:01 +0000 [info]: #1 fluent/log.rb:330:info: init workers logger path=nil rotate_age=nil rotate_size=nil
2024-06-27T16:12:07.967666275Z 2024-06-27 16:12:07 +0000 [trace]: #0 fluent/log.rb:287:trace: registered input plugin 'monitor_agent'
2024-06-27T16:12:08.765010604Z 2024-06-27 16:12:08 +0000 [trace]: #1 fluent/log.rb:287:trace: registered input plugin 'monitor_agent'
2024-06-27T16:12:14.205541864Z 2024-06-27 16:12:14 +0000 [trace]: #0 fluent/log.rb:287:trace: registered metrics plugin 'local'
2024-06-27T16:12:14.246520817Z 2024-06-27 16:12:14 +0000 [debug]: #0 fluent/log.rb:309:debug: No fluent logger for internal event
2024-06-27T16:12:14.246938159Z 2024-06-27 16:12:14 +0000 [info]: #0 fluent/log.rb:330:info: starting fluentd worker pid=154 ppid=19 worker=0
2024-06-27T16:12:14.247165212Z 2024-06-27 16:12:14 +0000 [debug]: #0 fluent/log.rb:309:debug: listening monitoring http server on http://0.0.0.0:24220/api/plugins for worker0
2024-06-27T16:12:14.249364282Z 2024-06-27 16:12:14 +0000 [debug]: #0 fluent/log.rb:309:debug: Start webrick HTTP server listening
2024-06-27T16:12:14.261697759Z 2024-06-27 16:12:14 +0000 [info]: #0 fluent/log.rb:330:info: fluentd worker is now running worker=0
2024-06-27T16:12:15.227378404Z 2024-06-27 16:12:15 +0000 [trace]: #1 fluent/log.rb:287:trace: registered metrics plugin 'local'
2024-06-27T16:12:15.234665606Z 2024-06-27 16:12:15 +0000 [debug]: #1 fluent/log.rb:309:debug: No fluent logger for internal event
2024-06-27T16:12:15.234926686Z 2024-06-27 16:12:15 +0000 [info]: #1 fluent/log.rb:330:info: starting fluentd worker pid=156 ppid=19 worker=1
2024-06-27T16:12:15.235206925Z 2024-06-27 16:12:15 +0000 [debug]: #1 fluent/log.rb:309:debug: listening monitoring http server on http://0.0.0.0:24221/api/plugins for worker1
2024-06-27T16:12:15.241959806Z 2024-06-27 16:12:15 +0000 [debug]: #1 fluent/log.rb:309:debug: Start webrick HTTP server listening
2024-06-27T16:12:15.247309553Z 2024-06-27 16:12:15 +0000 [info]: #1 fluent/log.rb:330:info: fluentd worker is now running worker=1

Your Error Log

Usually I don't get an error log, because things eventually start working correctly, it just takes a very long time...

When I hit Ctrl+C during fluentd startup (when it's taking 1 minute to load), I usually get a stacktrace that's very similar to this:

<internal:/usr/local/lib/ruby/3.3.0/rubygems/core_ext/kernel_require.rb>:37:in `require': Interrupt
        from <internal:/usr/local/lib/ruby/3.3.0/rubygems/core_ext/kernel_require.rb>:37:in `require'
        from /usr/local/bundle/ruby/3.3.0/gems/fluentd-1.15.3/lib/fluent/supervisor.rb:18:in `<top (required)>'
        from <internal:/usr/local/lib/ruby/3.3.0/rubygems/core_ext/kernel_require.rb>:37:in `require'
        from <internal:/usr/local/lib/ruby/3.3.0/rubygems/core_ext/kernel_require.rb>:37:in `require'
        from /usr/local/bundle/ruby/3.3.0/gems/fluentd-1.15.3/lib/fluent/command/fluentd.rb:19:in `<top (required)>'
        from <internal:/usr/local/lib/ruby/3.3.0/rubygems/core_ext/kernel_require.rb>:37:in `require'
        from <internal:/usr/local/lib/ruby/3.3.0/rubygems/core_ext/kernel_require.rb>:37:in `require'
        from /usr/local/bundle/ruby/3.3.0/gems/fluentd-1.15.3/bin/fluentd:15:in `<top (required)>'
        from /usr/local/bin/fluentd:14:in `load'
        from /usr/local/bin/fluentd:14:in `<main>'

Additional context

Interestingly, the first log line from fluentd always appears almost exactly 1 minute after the command was issued, it might not be a coincidence.

But even after that, the load still takes very long and the CPU usage is ~100%. For example it takes 5-10 minutes to load my actual configuration, which has a lot of rules and uses multiple plugins like Prometheus, Kafka, OpenSearch. This took only 1-2 seconds before the ruby uplift.

kenhys commented 3 months ago

I was unable to reproduce the issue outside my environment, even when I tried to build ruby the same way and install the same plugins.

Are there somewhere publicly available Dockerfile recipe to reproduce it with Ruby 3.3?

dinatamas commented 2 months ago

@kenhys I am sorry, but I was unable to create such a Dockerfile. I run fluentd on company stack, and even though I tried to recreate the same environment, there must be some small difference I was unable to indentify, which causes this bug... But I still have full access to the problematic environment, so I can run any sort of debug commands you'd find useful.

In the meantime, I have compared the "good" and the "bad" strace outputs. The differences begin mostly with the large number of clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=0, tv_nsec=307321677}) = 0 lines appearing when the gemspec files like /usr/local/lib/ruby/gems/3.3.0/specifications/default/abbrev-0.1.2.gemspec are being stat()'d.

In the "bad" strace:

[...]
2024-06-27T15:09:47.812048633Z clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=0, tv_nsec=40666906}) = 0
2024-06-27T15:09:47.812229082Z clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=0, tv_nsec=40783975}) = 0
2024-06-27T15:09:47.812461326Z clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=0, tv_nsec=40792810}) = 0
2024-06-27T15:09:47.812621680Z clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=0, tv_nsec=40934372}) = 0
2024-06-27T15:09:47.812842620Z stat("/usr/local/lib/ruby/gems/3.3.0/specifications/default/benchmark-0.3.0.gemspec", {st_mode=S_IFREG|0644, st_size=907, ...}) = 0
2024-06-27T15:09:47.813056889Z openat(AT_FDCWD, "/usr/local/lib/ruby/gems/3.3.0/specifications/default/benchmark-0.3.0.gemspec", O_RDONLY|O_CLOEXEC) = 6
2024-06-27T15:09:47.813226660Z ioctl(6, TCGETS, 0x7ffc72a51bd0)        = -1 ENOTTY (Inappropriate ioctl for device)
2024-06-27T15:09:47.813404374Z fstat(6, {st_mode=S_IFREG|0644, st_size=907, ...}) = 0
2024-06-27T15:09:47.813552395Z lseek(6, 0, SEEK_CUR)                   = 0
2024-06-27T15:09:47.813718359Z read(6, "# -*- encoding: utf-8 -*-\n# stub"..., 907) = 907
2024-06-27T15:09:47.813896775Z read(6, "", 8192)                       = 0
2024-06-27T15:09:47.814060434Z close(6)
2024-06-27T15:09:47.814346873Z clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=0, tv_nsec=41082553}) = 0
2024-06-27T15:09:47.815203255Z clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=0, tv_nsec=41183032}) = 0
2024-06-27T15:09:47.815207883Z clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=0, tv_nsec=41200884}) = 0
2024-06-27T15:09:47.815210588Z clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=0, tv_nsec=41309077}) = 0
[...]

versus the "good" strace:

2024-07-08T14:34:22.374316663Z newfstatat(AT_FDCWD, "/usr/local/lib/ruby/gems/3.3.0/specifications/default/benchmark-0.3.0.gemspec", {st_mode=S_IFREG|0644, st_size=907, ...}, 0) = 0
2024-07-08T14:34:22.374409857Z openat(AT_FDCWD, "/usr/local/lib/ruby/gems/3.3.0/specifications/default/benchmark-0.3.0.gemspec", O_RDONLY|O_CLOEXEC) = 6
2024-07-08T14:34:22.374452593Z ioctl(6, TCGETS, 0x7ffdc639b600)        = -1 ENOTTY (Inappropriate ioctl for device)
2024-07-08T14:34:22.374537885Z fstat(6, {st_mode=S_IFREG|0644, st_size=907, ...}) = 0
2024-07-08T14:34:22.374583757Z lseek(6, 0, SEEK_CUR)                   = 0
2024-07-08T14:34:22.374654208Z read(6, "# -*- encoding: utf-8 -*-\n# stub"..., 907) = 907
2024-07-08T14:34:22.374713226Z read(6, "", 8192)                       = 0
2024-07-08T14:34:22.374776488Z close(6)

In the good case each gemspec file is examined immediately after each other. In the bad case a lot of clock_gettime()'s separate the stat() calls. Also, for some reason the bad uses stat() and the good uses newfstatat(). Could this cause the slowness?

dinatamas commented 2 months ago

I was unable to reproduce the issue outside my environment, even when I tried to build ruby the same way and install the same plugins.

Are there somewhere publicly available Dockerfile recipe to reproduce it with Ruby 3.3?

I managed to reproduce the issue with a Dockerfile: https://github.com/dinatamas/fluentd-4545

There are two Dockerfiles, the "good" which is ruby 2.5.9 and fluentd loads immediately, and the "bad" which is ruby 3.3.1 and fluentd startup takes a long time. I managed to reproduce the issue in multiple independent environments, so hopefully it will work.

dinatamas commented 2 months ago

This is just a comment to avoid flagging this issue as stale.

@kenhys I would be happy to try to investigate further, but I don't have ruby experience, do you perhaps have some tool suggestions or tips for debugging / what to look for?

kenhys commented 1 month ago

Thanks @dinatamas , I can reproduce it.

I'm not sure why, but it seems that ENV RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR=0.9 is not suitable parameter for ruby 3.3.1. Please disable it and rebuild image.

#ENV RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR=0.9
kenhys commented 1 month ago

Checked some more environment how many delay observed?:

As for recent version of Ruby, it seems that RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR=0.9 may be a bit radical parameter to force full GC.

kenhys commented 1 month ago

I've added this issue in official documentation.

https://github.com/fluent/fluentd-docs-gitbook/pull/513