fluent / fluent-operator

Operate Fluent Bit and Fluentd in the Kubernetes way - Previously known as FluentBit Operator
Apache License 2.0
555 stars 229 forks source link

bug: fluentd unable to start with error="fork/exec /usr/local/bundle/bin/fluentd: no such file or directory" #1187

Open joshuabaird opened 1 month ago

joshuabaird commented 1 month ago

Describe the issue

It seems a recent image was pushed to the kubesphere/fluentd:1.15.3 tag (docker.io/kubesphere/fluentd@sha256:bc06e880c224e76e659bf59250e5302ad159ee6b5474a2c5ee45f3a0969644c5) which breaks fluentd:

level=error msg="start Fluentd error" error="fork/exec /usr/local/bundle/bin/fluentd: no such file or directory"

Pinning to a previous SHA fixes the issue -- kubesphere/fluentd:v1.15.3@sha256:794311919658aee8eb9829836cd6c3437dffd9c7112556d5dc2f01ca3fcb826b.

To Reproduce

Repull the kubesphere/fluentd:1.15.3 latest SHA.

Expected behavior

Fluentd should start.

Your Environment

N/A

How did you install fluent operator?

No response

Additional context

No response

joshuabaird commented 1 month ago

@benjaminhuo @wenchajun Can someone please review this?

m-gavrilyuk commented 1 month ago

same error on fluent/fluent-operator/fluentd:v2.8.0

rurus9 commented 4 weeks ago

I think that in general each new image should have a new tag (does not apply to floating tags, like "latest").

joshuabaird commented 3 weeks ago

I would agree. Folks rely on versioned tags for stability and they should be immutable. If these images are going to be rebuilt for whatever reason, perhaps an internal "patch" version should be added (eg, v1.15.3.x).

benjaminhuo commented 3 weeks ago

I would agree. Folks rely on versioned tags for stability and they should be immutable. If these images are going to be rebuilt for whatever reason, perhaps an internal "patch" version should be added (eg, v1.15.3.x).

This might be related to the CI changes we made recently, cc @sarathchandra24

https://github.com/fluent/fluent-operator/pull/1183 https://github.com/fluent/fluent-operator/pull/1079

I also remember there is an PR for a similar issue from @sarathchandra24 https://github.com/fluent/fluent-operator/pull/1093

Would you help to take a look? @sarathchandra24

Thanks

benjaminhuo commented 3 weeks ago

I've built fluentd v1.17.0 image

image

joshuabaird commented 3 weeks ago

@benjaminhuo It looks like the 1.17.0 image has the same bug for x86_64 images. Is this expected?

sarathchandra24 commented 3 weeks ago

Sorry for the late response everyone, I realized the problem after running it locally.

Root cause is defaultBinPath on main.go#L22 for amd64 it is"/usr/bin/fluentd" and for arm64 it is "/usr/local/bundle/bin/fluentd".

Creating a PR for logic to choose path based on arch.

benjaminhuo commented 3 weeks ago

Sorry for the late response everyone, I realized the problem after running it locally.

Root cause is defaultBinPath on main.go#L22 for amd64 it is"/usr/bin/fluentd" and for arm64 it is "/usr/local/bundle/bin/fluentd".

Creating a PR for logic to choose path based on arch.

Thank you very much @sarathchandra24

benjaminhuo commented 3 weeks ago

both 1.15.3 and 1.17.0 are updated, would you try again? @joshuabaird image

joshuabaird commented 3 weeks ago

@benjaminhuo @sarathchandra24 The bug is still present inkubesphere/fluentd:1.17.0@sha256:bc06e880c224e76e659bf59250e5302ad159ee6b5474a2c5ee45f3a0969644c5:

fluentd-1 fluentd level=error msg="start Fluentd error" error="fork/exec /usr/local/bundle/bin/fluentd: no such file or directory"
fluentd-1 fluentd level=info msg=backoff delay=4s

It looks like the v1.15.3 image is still broken as well.

sarathchandra24 commented 3 weeks ago

@joshuabaird Can I know what OS are you using.

Also, I think there is something wrong with the builds or build system.

image

You see the message

level=info msg="Current architecture" arch=amd64

Also for docker run sarathchandra24/fluentd-arm:local-arm1

image

You see the message

level=info msg="Current architecture" arch=arm64

But this is not the case while running docker run kubesphere/fluentd:1.17.0@sha256:bc06e880c224e76e659bf59250e5302ad159ee6b5474a2c5ee45f3a0969644c5

image
sarathchandra24 commented 3 weeks ago

@joshuabaird Can you please run

docker run ghcr.io/fluent/fluent-operator/fluentd:v1.17@sha256:095572fbf94ee3bbd01c0597b7b8a113c647e64ad2c53457c9c561432207f99d

and

docker run ghcr.io/fluent/fluent-operator/fluentd:v1.17@sha256:baac1724e2277baf50817d2612f06f0bf3b9050a77e1f7b78d351386b84541b7

To check if GitHub images are working

After inspecting images on GitHub

running: docker run ghcr.io/fluent/fluent-operator/fluentd:v1.17@sha256:095572fbf94ee3bbd01c0597b7b8a113c647e64ad2c53457c9c561432207f99d

image

We can see the message level=info msg="Current architecture" arch=amd64

running: docker run ghcr.io/fluent/fluent-operator/fluentd:v1.17@sha256:baac1724e2277baf50817d2612f06f0bf3b9050a77e1f7b78d351386b84541b7

image

We can see the message level=info msg="Current architecture" arch=arm64

joshuabaird commented 3 weeks ago

@sarathchandra24 Yeah - I'm not seeing the log statements on the images in Dockerhub. The images on Github do appear to be working as expected (I see the log statements).

We may have a CI problem with copying from Github to Dockerhub. I'll take a look at the CI runs and see if I can spot anything.

joshuabaird commented 3 weeks ago

@benjaminhuo Also, just noticed that the fluentbit images aren't available in Github (ghcr.io) -- so we probably need to manually run the CI job that pushes them.

joshuabaird commented 3 weeks ago

It also looks like the v1.17.0 linux/amd64 image on GHCR is actually 1.15.3:

❯ docker run --platform linux/amd64 ghcr.io/fluent/fluent-operator/fluentd:v1.17.0                                                                                                      
level=info msg="Current architecture" arch=amd64
level=info msg="Fluentd started"

2024-06-05 16:03:02 +0000 [info]: init supervisor logger path=nil rotate_age=nil rotate_size=nil
2024-06-05 16:03:02 +0000 [info]: parsing config file is succeeded path="/fluentd/etc/fluent.conf"
2024-06-05 16:03:02 +0000 [info]: gem 'fluentd' version '1.15.3'
...
2024-06-05 16:03:02 +0000 [info]: starting fluentd-1.15.3 pid=13 ruby="3.2.4"
2024-06-05 16:03:02 +0000 [info]: spawn command to main:  cmdline=["/usr/bin/ruby", "-Eascii-8bit:ascii-8bit", "/usr/bin/fluentd", "-c", "/fluentd/etc/fluent.conf", "-p", "/fluentd/plugins", "--under-supervisor"]
2024-06-05 16:03:02 +0000 [info]: init supervisor logger path=nil rotate_age=nil rotate_size=nil

The linux/arm64 image however is actually v1.17.0:


❯ docker run --platform linux/arm64 ghcr.io/fluent/fluent-operator/fluentd:v1.17.0
Unable to find image 'ghcr.io/fluent/fluent-operator/fluentd:v1.17.0' locally
v1.17.0: Pulling from fluent/fluent-operator/fluentd
Digest: sha256:4651f4340241b53534c5b481422082d9e785e4f9e86cd2d027a51f61e521fe2e
Status: Downloaded newer image for ghcr.io/fluent/fluent-operator/fluentd:v1.17.0
level=info msg="Current architecture" arch=arm64
level=info msg="Fluentd started"
2024-06-05 16:04:25 +0000 [info]: init supervisor logger path=nil rotate_age=nil rotate_size=nil
2024-06-05 16:04:25 +0000 [info]: parsing config file is succeeded path="/fluentd/etc/fluent.conf"
2024-06-05 16:04:25 +0000 [info]: gem 'fluentd' version '1.17.0'
...
2024-06-05 16:04:25 +0000 [info]: starting fluentd-1.17.0 pid=14 ruby="3.3.2"
2024-06-05 16:04:25 +0000 [info]: spawn command to main:  cmdline=["/usr/local/bin/ruby", "-Eascii-8bit:ascii-8bit", "/usr/local/bundle/bin/fluentd", "-c", "/fluentd/etc/fluent.conf", "-p", "/fluentd/plugins", "--under-supervisor"]
2024-06-05 16:04:25 +0000 [info]: #0 init worker0 logger path=nil rotate_age=nil rotate_size=nil
2024-06-05 16:04:25 +0000 [info]: adding match in @FLUENT_LOG pattern="fluent.*" type="null"
2024-06-05 16:04:25 +0000 [info]: #0 starting fluentd worker pid=23 ppid=14 worker=0
2024-06-05 16:04:25 +0000 [info]: #0 fluentd worker is now running worker=0
benjaminhuo commented 3 weeks ago

@joshuabaird I've added you as the maintainer, and you can trigger the image build here: image

joshuabaird commented 3 weeks ago

@benjaminhuo @sarathchandra24 Is the intention to build and maintain both v1.15.3 and v1.17.0 fluentd images? Even if you pass 1.17.0 to the workflow, the Dockerfile still installs v1.15.3 here:

https://github.com/fluent/fluent-operator/blob/2de2b934e4d9f4475e278da7d9db74d89ef9a037/cmd/fluent-watcher/fluentd/Dockerfile.amd64#L30

So, if the intention is to build/maintain both v1.15.3 and 1.17.0, the Dockerfile will need to be modified.

benjaminhuo commented 2 weeks ago

@benjaminhuo @sarathchandra24 Is the intention to build and maintain both v1.15.3 and v1.17.0 fluentd images? Even if you pass 1.17.0 to the workflow, the Dockerfile still installs v1.15.3 here:

https://github.com/fluent/fluent-operator/blob/2de2b934e4d9f4475e278da7d9db74d89ef9a037/cmd/fluent-watcher/fluentd/Dockerfile.amd64#L30

So, if the intention is to build/maintain both v1.15.3 and 1.17.0, the Dockerfile will need to be modified.

@joshuabaird You're right, the version is hardcoded in dockerfile for fluentd, we need to change that to use new version of fluentd

joshuabaird commented 2 weeks ago

@benjaminhuo But do we want to continue to support v1.15.3 or just modify the Dockerfiles to use 1.17.0?

benjaminhuo commented 2 weeks ago

@benjaminhuo But do we want to continue to support v1.15.3 or just modify the Dockerfiles to use 1.17.0?

we already have 1.51.3 image built that meets some people's requirement, I think we can move on to the latest version of fluentd, the image can be replaced to a older version if he needs

joshuabaird commented 2 weeks ago

@benjaminhuo https://github.com/fluent/fluent-operator/pull/1198

benjaminhuo commented 2 weeks ago

@benjaminhuo #1198

@joshuabaird The new fluentd image for 1.17 has been rebuilt after your PR, would you give it a try? image

joshuabaird commented 2 weeks ago

Things are looking good. I'm going to open a PR to update fluentbit and then we'll rebuild the fluentbit images so they get pushed to GHCR.

joshuabaird commented 1 week ago

@benjaminhuo Any idea why fluentd:v2.8.0 and fluent-bit:v2.8.0 tags exist?

This is confusing, because it's the operator tag, not the fluentd/fluent-bit tag. This is causing dependency update apps (like Dependabot/Renovate) to try and update these images.

Should we delete them?

image
benjaminhuo commented 1 week ago

@benjaminhuo Any idea why fluentd:v2.8.0 and fluent-bit:v2.8.0 tags exist?

This is confusing, because it's the operator tag, not the fluentd/fluent-bit tag. This is causing dependency update apps (like Dependabot/Renovate) to try and update these images.

Should we delete them?

image

@joshuabaird I can delete them, they're created by wrong CI workflow

benjaminhuo commented 1 week ago

image 2.8.0 are all deleted

joshuabaird commented 1 week ago

@benjaminhuo Great, thank you!

vajgi90 commented 1 week ago

We use Fluent-operator version 2.7.0, which uses Fluentd v.1.15.3. Unfortunately, we get the same error now: level=info msg="backoff timer done" actual=16.013265218s expected=16s level=error msg="start Fluentd error" error="fork/exec /usr/local/bundle/bin/fluentd: no such file or directory" level=info msg=backoff delay=32s with the old image as well, which worked fine priorly. What can I do to get Fluentd to start?

joshuabaird commented 1 week ago

@vajgi90 It looks like the amd64 image on Dockerhub for v1.15.3 has the bug. We'll try to get this fixed. Until then, you have two options:

vajgi90 commented 1 week ago

Great, thank you so much for the quick response!