Closed dinatamas closed 1 month ago
I was unable to reproduce the issue outside my environment, even when I tried to build ruby the same way and install the same plugins.
Are there somewhere publicly available Dockerfile recipe to reproduce it with Ruby 3.3?
@kenhys I am sorry, but I was unable to create such a Dockerfile. I run fluentd on company stack, and even though I tried to recreate the same environment, there must be some small difference I was unable to indentify, which causes this bug... But I still have full access to the problematic environment, so I can run any sort of debug commands you'd find useful.
In the meantime, I have compared the "good" and the "bad" strace outputs. The differences begin mostly with the large number of clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=0, tv_nsec=307321677}) = 0
lines appearing when the gemspec files like /usr/local/lib/ruby/gems/3.3.0/specifications/default/abbrev-0.1.2.gemspec
are being stat()
'd.
In the "bad" strace:
[...]
2024-06-27T15:09:47.812048633Z clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=0, tv_nsec=40666906}) = 0
2024-06-27T15:09:47.812229082Z clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=0, tv_nsec=40783975}) = 0
2024-06-27T15:09:47.812461326Z clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=0, tv_nsec=40792810}) = 0
2024-06-27T15:09:47.812621680Z clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=0, tv_nsec=40934372}) = 0
2024-06-27T15:09:47.812842620Z stat("/usr/local/lib/ruby/gems/3.3.0/specifications/default/benchmark-0.3.0.gemspec", {st_mode=S_IFREG|0644, st_size=907, ...}) = 0
2024-06-27T15:09:47.813056889Z openat(AT_FDCWD, "/usr/local/lib/ruby/gems/3.3.0/specifications/default/benchmark-0.3.0.gemspec", O_RDONLY|O_CLOEXEC) = 6
2024-06-27T15:09:47.813226660Z ioctl(6, TCGETS, 0x7ffc72a51bd0) = -1 ENOTTY (Inappropriate ioctl for device)
2024-06-27T15:09:47.813404374Z fstat(6, {st_mode=S_IFREG|0644, st_size=907, ...}) = 0
2024-06-27T15:09:47.813552395Z lseek(6, 0, SEEK_CUR) = 0
2024-06-27T15:09:47.813718359Z read(6, "# -*- encoding: utf-8 -*-\n# stub"..., 907) = 907
2024-06-27T15:09:47.813896775Z read(6, "", 8192) = 0
2024-06-27T15:09:47.814060434Z close(6)
2024-06-27T15:09:47.814346873Z clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=0, tv_nsec=41082553}) = 0
2024-06-27T15:09:47.815203255Z clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=0, tv_nsec=41183032}) = 0
2024-06-27T15:09:47.815207883Z clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=0, tv_nsec=41200884}) = 0
2024-06-27T15:09:47.815210588Z clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=0, tv_nsec=41309077}) = 0
[...]
versus the "good" strace:
2024-07-08T14:34:22.374316663Z newfstatat(AT_FDCWD, "/usr/local/lib/ruby/gems/3.3.0/specifications/default/benchmark-0.3.0.gemspec", {st_mode=S_IFREG|0644, st_size=907, ...}, 0) = 0
2024-07-08T14:34:22.374409857Z openat(AT_FDCWD, "/usr/local/lib/ruby/gems/3.3.0/specifications/default/benchmark-0.3.0.gemspec", O_RDONLY|O_CLOEXEC) = 6
2024-07-08T14:34:22.374452593Z ioctl(6, TCGETS, 0x7ffdc639b600) = -1 ENOTTY (Inappropriate ioctl for device)
2024-07-08T14:34:22.374537885Z fstat(6, {st_mode=S_IFREG|0644, st_size=907, ...}) = 0
2024-07-08T14:34:22.374583757Z lseek(6, 0, SEEK_CUR) = 0
2024-07-08T14:34:22.374654208Z read(6, "# -*- encoding: utf-8 -*-\n# stub"..., 907) = 907
2024-07-08T14:34:22.374713226Z read(6, "", 8192) = 0
2024-07-08T14:34:22.374776488Z close(6)
In the good case each gemspec file is examined immediately after each other. In the bad case a lot of clock_gettime()
's separate the stat()
calls.
Also, for some reason the bad uses stat()
and the good uses newfstatat()
. Could this cause the slowness?
I was unable to reproduce the issue outside my environment, even when I tried to build ruby the same way and install the same plugins.
Are there somewhere publicly available Dockerfile recipe to reproduce it with Ruby 3.3?
I managed to reproduce the issue with a Dockerfile: https://github.com/dinatamas/fluentd-4545
There are two Dockerfiles, the "good" which is ruby 2.5.9 and fluentd loads immediately, and the "bad" which is ruby 3.3.1 and fluentd startup takes a long time. I managed to reproduce the issue in multiple independent environments, so hopefully it will work.
This is just a comment to avoid flagging this issue as stale.
@kenhys I would be happy to try to investigate further, but I don't have ruby experience, do you perhaps have some tool suggestions or tips for debugging / what to look for?
Thanks @dinatamas , I can reproduce it.
I'm not sure why, but it seems that ENV RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR=0.9
is not suitable parameter for ruby 3.3.1.
Please disable it and rebuild image.
#ENV RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR=0.9
Checked some more environment how many delay observed?:
As for recent version of Ruby, it seems that RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR=0.9
may be a bit radical parameter to force full GC.
I've added this issue in official documentation.
Describe the bug
I am running fluentd in a docker container, and noticed that when I uplift ruby from version 2.5 (provided by the OS) to version 3.3 (built by me), then the startup time of fluentd increases significantly (from <1 second to multiple minutes).
The main bottleneck is the CPU:
and then later:
The high CPU load only stops once the
fluentd worker is now running worker
messages appear.I ran an strace, and found that calls like the following happen hundreds of times each second:
To Reproduce
I was unable to reproduce the issue outside my environment, even when I tried to build ruby the same way and install the same plugins. But for me it happens very reliably, each time I start the container. I have the full strace output, but it's 260'000 lines long, and 240'000 of it are just those
clock_gettime()
calls.Your Environment
Note: Uplifting fluentd to v1.17.0 does not help.
Gemfile:
Your Configuration
I am deliberately running a simple - basically empty - config, and even with this the startup takes a very long time. It's dumped in the following error log. It also takes ~1 minute to simply execute
fluentd --version
.Find the fluentd trace logs of the startup below. There is almost an entire minute between the first line (the fluentd command being issued) and the first log message from fluentd:
Your Error Log
Usually I don't get an error log, because things eventually start working correctly, it just takes a very long time...
When I hit Ctrl+C during fluentd startup (when it's taking 1 minute to load), I usually get a stacktrace that's very similar to this:
Additional context
Interestingly, the first log line from fluentd always appears almost exactly 1 minute after the command was issued, it might not be a coincidence.
But even after that, the load still takes very long and the CPU usage is ~100%. For example it takes 5-10 minutes to load my actual configuration, which has a lot of rules and uses multiple plugins like Prometheus, Kafka, OpenSearch. This took only 1-2 seconds before the ruby uplift.