Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.93k stars 293 forks source link

[BUG] stack overflow : fluentbit sidecar container in fluxagent and fluxcontroller restarting continuously #4358

Open Galileo1 opened 3 weeks ago

Galileo1 commented 3 weeks ago

Describe the bug fluentbit the sidecar container in fluxagent and fluxcontroller is restarting quite a lot.

To Reproduce Steps to reproduce the behavior:

  1. Create a AKS cluster version 1.28.9 and install the microsoft.flux extension
  2. See error

Environment : Please note that right now we don't have the capability to configure flux-controller or fluentbit sidercar running as they are out of the box installed by the AKS extension.

Additional context Error:

runtime: morestack on g0, stack [0x7f0409c95640 0x7f0409c9da40], sp=0x7f0409c957b8, called from
runtime.exitsyscallfast.func1()
runtime/debug.Stack()
        /usr/local/go/src/runtime/debug/stack.go:24 +0x5e
sigs.k8s.io/controller-runtime/pkg/log.eventuallyFulfillRoot()
        /go/src/github.com/azure-core/ClusterConfigurationAgent/FluentBit/vendor/sigs.k8s.io/controller-runtime/pkg/log/log.go:59 +0xcd
sigs.k8s.io/controller-runtime/pkg/log.(*delegatingLogSink).WithName(0xc000230740, {0x7f0457156920, 0x14})
        /go/src/github.com/azure-core/ClusterConfigurationAgent/FluentBit/vendor/sigs.k8s.io/controller-runtime/pkg/log/deleg.go:147 +0x3e
github.com/go-logr/logr.Logger.WithName({{0x7f045781aeb0, 0xc000230740}, 0x0}, {0x7f0457156920?, 0xc000492460?})
        /go/src/github.com/azure-core/ClusterConfigurationAgent/FluentBit/vendor/github.com/go-logr/logr/logr.go:336 +0x36
sigs.k8s.io/controller-runtime/pkg/client.newClient(0xc000554b48, {0x0, 0xc000492460, {0x0, 0x0}, 0x0, {0x0, 0x0}, 0x0})
        /go/src/github.com/azure-core/ClusterConfigurationAgent/FluentBit/vendor/sigs.k8s.io/controller-runtime/pkg/client/client.go:115 +0xa5
sigs.k8s.io/controller-runtime/pkg/client.New(0x7f04571534c3?, {0x0, 0xc000492460, {0x0, 0x0}, 0x0, {0x0, 0x0}, 0x0})
        /go/src/github.com/azure-core/ClusterConfigurationAgent/FluentBit/vendor/sigs.k8s.io/controller-runtime/pkg/client/client.go:101 +0x7d
github.com/azure-core/ClusterConfigurationAgent/Utils/KuberenetesAPIServer.(*KubeAPIServerQueries).initializeClientInternal(0xc0001d0750, {0x0?, 0x7f04577fe6e8?}, {0x7f0457816b00?, 0xc000558120?}, {0x0?, 0x0?}, {0x0?, 0x0?}, {0x0, ...}, ...)
        /go/src/github.com/azure-core/ClusterConfigurationAgent/FluentBit/vendor/github.com/azure-core/ClusterConfigurationAgent/Utils/KuberenetesAPIServer/APIServerQueries.go:239 +0x80d
github.com/azure-core/ClusterConfigurationAgent/Utils/KuberenetesAPIServer.(*KubeAPIServerQueries).InitializeClient(...)
        /go/src/github.com/azure-core/ClusterConfigurationAgent/FluentBit/vendor/github.com/azure-core/ClusterConfigurationAgent/Utils/KuberenetesAPIServer/APIServerQueries.go:114
main.FLBPluginFlushCtx(0x7f0455ea78f2?, 0x7f0409cb8a80, 0x514, 0x7f0458634f01?)
        /go/src/github.com/azure-core/ClusterConfigurationAgent/FluentBit/http_mdm.go:124 +0x734
philip-bradshaw commented 2 weeks ago

We are also seeing this issue as well and it is causing a lot of noise in our monitoring, which makes it harder to identify actual issues.

stefanhenseler commented 1 week ago

Seeing the same issue. Fluent bit crashing in fluxconfig- pods almost every 10 minutes. We are running 1.29.2 and 1.29.4 AKS clusters and Flux Extension 1.10.0. The fluent bit version Fluent Bit v2.1.8:

MICROSOFT SOFTWARE LICENSE TERMS
MICROSOFT Azure Arc for Kubernetes
__________________________________

Fluent Bit v2.1.8
* Copyright (C) 2015-2022 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2024/06/30 13:55:25] [ info] [fluent bit] version=2.1.8, commit=1d83649441, pid=1
{"Message":"2024/06/30 13:55:25 [http_mdm] id = \"http_mdm_plugin\"","LogType":"ConfigAgentTrace","LogLevel":"Information","Environment":"prod","Role":"ClusterConfigAgent","Location":"westeurope","ArmId":"<redacted>","CorrelationId":"","AgentName":"FluxConfigAgent","AgentVersion":"1.10.0","AgentTimestamp":"2024/06/30 13:55:25.045"}
[2024/06/30 13:55:25] [ info] [storage] ver=1.4.0, type=memory, sync=normal, checksum=off, max_chunks_up=128
[2024/06/30 13:55:25] [ info] [cmetrics] version=0.6.3
[2024/06/30 13:55:25] [ info] [ctraces ] version=0.3.1
[2024/06/30 13:55:25] [ info] [input:tail:tail.0] initializing
[2024/06/30 13:55:25] [ info] [input:tail:tail.0] storage_strategy='memory' (memory only)
[2024/06/30 13:55:25] [ info] [sp] stream processor started
[2024/06/30 13:55:25] [ info] [input:tail:tail.0] inotify_fs_add(): inode=2236522 watch_fd=1 name=/var/log/containers/fluxconfig-agent-8c696c474-l59vk_flux-system_fluxconfig-agent-3d589d77a49384913a1f9181847397faaa473a86d0864fc1f80d32c7491996c8.log
[2024/06/30 13:55:25] [ info] [input:tail:tail.0] inotify_fs_add(): inode=2236586 watch_fd=2 name=/var/log/containers/fluxconfig-agent-8c696c474-l59vk_flux-system_fluent-bit-8eb27afc9151d92111997e56a96eba61dafe8ac09533170ca23de0cf194e1463.log
{"Message":"2024/06/30 14:00:24 [http_mdm] Flush called for id: http_mdm_plugin","LogType":"ConfigAgentTrace","LogLevel":"Information","Environment":"prod","Role":"ClusterConfigAgent","Location":"westeurope","ArmId":"<redacted>","CorrelationId":"","AgentName":"FluxConfigAgent","AgentVersion":"1.10.0","AgentTimestamp":"2024/06/30 14:00:24.742"}
[controller-runtime] log.SetLogger(...) was never called, logs will not be displayed:
goroutine 34 [running, locked to thread]:
runtime/debug.Stack()
        /usr/local/go/src/runtime/debug/stack.go:24 +0x5e
sigs.k8s.io/controller-runtime/pkg/log.eventuallyFulfillRoot()
        /go/src/github.com/azure-core/ClusterConfigurationAgent/FluentBit/vendor/sigs.k8s.io/controller-runtime/pkg/log/log.go:59 +0xcd
sigs.k8s.io/controller-runtime/pkg/log.(*delegatingLogSink).WithName(0xc0002b0280, {0x7a5641756920, 0x14})
        /go/src/github.com/azure-core/ClusterConfigurationAgent/FluentBit/vendor/sigs.k8s.io/controller-runtime/pkg/log/deleg.go:147 +0x3e
github.com/go-logr/logr.Logger.WithName({{0x7a5641e1aeb0, 0xc0002b0280}, 0x0}, {0x7a5641756920?, 0xc00033d570?})
        /go/src/github.com/azure-core/ClusterConfigurationAgent/FluentBit/vendor/github.com/go-logr/logr/logr.go:336 +0x36
sigs.k8s.io/controller-runtime/pkg/client.newClient(0xc00052a488, {0x0, 0xc00033d570, {0x0, 0x0}, 0x0, {0x0, 0x0}, 0x0})
        /go/src/github.com/azure-core/ClusterConfigurationAgent/FluentBit/vendor/sigs.k8s.io/controller-runtime/pkg/client/client.go:115 +0xa5
sigs.k8s.io/controller-runtime/pkg/client.New(0x7a56417534c3?, {0x0, 0xc00033d570, {0x0, 0x0}, 0x0, {0x0, 0x0}, 0x0})
        /go/src/github.com/azure-core/ClusterConfigurationAgent/FluentBit/vendor/sigs.k8s.io/controller-runtime/pkg/client/client.go:101 +0x7d
github.com/azure-core/ClusterConfigurationAgent/Utils/KuberenetesAPIServer.(*KubeAPIServerQueries).initializeClientInternal(0xc0002af050, {0x0?, 0x7a5641dfe6e8?}, {0x7a5641e16b00?, 0xc0002db620?}, {0x0?, 0x0?}, {0x0?, 0x0?}, {0x0, ...}, ...)
        /go/src/github.com/azure-core/ClusterConfigurationAgent/FluentBit/vendor/github.com/azure-core/ClusterConfigurationAgent/Utils/KuberenetesAPIServer/APIServerQueries.go:239 +0x80d
github.com/azure-core/ClusterConfigurationAgent/Utils/KuberenetesAPIServer.(*KubeAPIServerQueries).InitializeClient(...)
        /go/src/github.com/azure-core/ClusterConfigurationAgent/FluentBit/vendor/github.com/azure-core/ClusterConfigurationAgent/Utils/KuberenetesAPIServer/APIServerQueries.go:114
main.FLBPluginFlushCtx(0x7a56404a78f2?, 0x7a55f48b89c0, 0x800, 0x7a5642c34f01?)
        /go/src/github.com/azure-core/ClusterConfigurationAgent/FluentBit/http_mdm.go:124 +0x734
{"Message":"Using the extension bearer token provider","LogType":"ConfigAgentTrace","LogLevel":"Information","Environment":"prod","Role":"ClusterConfigAgent","Location":"westeurope","ArmId":"<redacted>","CorrelationId":"","AgentName":"FluxConfigAgent","AgentVersion":"1.10.0","AgentTimestamp":"2024/06/30 14:00:24.745"}
{"Message":"2024/06/30 14:00:26 [http_mdm] Flush called for id: http_mdm_plugin","LogType":"ConfigAgentTrace","LogLevel":"Information","Environment":"prod","Role":"ClusterConfigAgent","Location":"westeurope","ArmId":"<redacted>","CorrelationId":"","AgentName":"FluxConfigAgent","AgentVersion":"1.10.0","AgentTimestamp":"2024/06/30 14:00:26.342"}
{"Message":"2024/06/30 14:05:24 [http_mdm] Flush called for id: http_mdm_plugin","LogType":"ConfigAgentTrace","LogLevel":"Information","Environment":"prod","Role":"ClusterConfigAgent","Location":"westeurope","ArmId":"<redacted>","CorrelationId":"","AgentName":"FluxConfigAgent","AgentVersion":"1.10.0","AgentTimestamp":"2024/06/30 14:05:24.742"}
{"Message":"2024/06/30 14:05:25 [http_mdm] Flush called for id: http_mdm_plugin","LogType":"ConfigAgentTrace","LogLevel":"Information","Environment":"prod","Role":"ClusterConfigAgent","Location":"westeurope","ArmId":"<redacted>","CorrelationId":"","AgentName":"FluxConfigAgent","AgentVersion":"1.10.0","AgentTimestamp":"2024/06/30 14:05:25.042"}
runtime: morestack on g0, stack [0x7a55f489a1c0 0x7a55f48a25c0], sp=0x7a55f489a678, called from
runtime.newstack()
        /usr/local/go/src/runtime/stack.go:962 +0xbed fp=0x7a55f489a680 sp=0x7a55f489a678 pc=0x7a56404f604d
runtime.morestack()
        /usr/local/go/src/runtime/asm_amd64.s:616 +0x77 fp=0x7a55f489a688 sp=0x7a55f489a680 pc=0x7a564050e237

fatal error: morestack on g0

runtime stack:
runtime.throw({0x7a564174f1e3?, 0x0?})
        /usr/local/go/src/runtime/panic.go:1023 +0x5e fp=0xc000059f88 sp=0xc000059f58 pc=0x7a56404da4be
runtime.badmorestackg0.func1()
        /usr/local/go/src/runtime/proc.go:533 +0xe5 fp=0xc000059fe0 sp=0xc000059f88 pc=0x7a56404ddd45
runtime.switchToCrashStack0()
        /usr/local/go/src/runtime/asm_amd64.s:559 +0x34 fp=0xc000059ff0 sp=0xc000059fe0 pc=0x7a564050e1b4

goroutine 34 gp=0xc000190380 m=7 mp=0xc00018e008 [running, locked to thread]:
runtime.(*lockTimer).end(0xc000375858)
        /usr/local/go/src/runtime/mprof.go:601 +0xbc fp=0xc000375820 sp=0xc000375818 pc=0x7a56404d23fc
runtime.lock2(0x7a5642b91f18)
        /usr/local/go/src/runtime/lock_futex.go:91 +0x1c5 fp=0xc000375888 sp=0xc000375820 pc=0x7a56404ad425
runtime.lockWithRank(...)
        /usr/local/go/src/runtime/lockrank_off.go:24
runtime.lock(...)
        /usr/local/go/src/runtime/lock_futex.go:52
runtime.GOMAXPROCS(0x0)
        /usr/local/go/src/runtime/debug.go:21 +0x25 fp=0xc0003758b0 sp=0xc000375888 pc=0x7a56404a8785
sync.(*Pool).pinSlow(0x7a5642b05780)
        /usr/local/go/src/sync/pool.go:237 +0x165 fp=0xc000375948 sp=0xc0003758b0 pc=0x7a5640517025
sync.(*Pool).pin(0x7a5642b05780)
        /usr/local/go/src/sync/pool.go:217 +0x46 fp=0xc000375968 sp=0xc000375948 pc=0x7a5640516e86
sync.(*Pool).Get(0x7a5642b05780)
        /usr/local/go/src/sync/pool.go:132 +0x1c fp=0xc0003759a0 sp=0xc000375968 pc=0x7a5640516bdc
log.getBuffer(...)
        /usr/local/go/src/log/log.go:169
log.(*Logger).output(0xc000126690, 0x0, 0x2, 0xc000375de0)
        /usr/local/go/src/log/log.go:235 +0x297 fp=0xc000375b60 sp=0xc0003759a0 pc=0x7a564072f337
log.Printf(...)
        /usr/local/go/src/log/log.go:397
main.FLBPluginFlushCtx(0x7a56404a78f2?, 0x7a55f48b8080, 0x4b4, 0x7a5642c34f01?)
        /go/src/github.com/azure-core/ClusterConfigurationAgent/FluentBit/http_mdm.go:88 +0x145 fp=0xc000375e28 sp=0xc000375b60 pc=0x7a5641736a85
_cgoexp_6eaf2cdcfedc_FLBPluginFlushCtx(0x7a55f489a700)
        _cgo_gotypes.go:79 +0x29 fp=0xc000375e58 sp=0xc000375e28 pc=0x7a56417398a9
runtime.cgocallbackg1(0x7a5641739880, 0x7a55f489a700, 0x0)
        /usr/local/go/src/runtime/cgocall.go:403 +0x2a5 fp=0xc000375f18 sp=0xc000375e58 pc=0x7a56404a5e05
runtime.cgocallbackg(0x7a5641739880, 0x7a55f489a700, 0x0)
        /usr/local/go/src/runtime/cgocall.go:322 +0x138 fp=0xc000375f90 sp=0xc000375f18 pc=0x7a56404a5ab8
runtime.cgocallbackg(0x7a5641739880, 0x7a55f489a700, 0x0)
        <autogenerated>:1 +0x2b fp=0xc000375fb8 sp=0xc000375f90 pc=0x7a564051278b
runtime.cgocallback(0x0, 0x0, 0x0)
        /usr/local/go/src/runtime/asm_amd64.s:1079 +0xcd fp=0xc000375fe0 sp=0xc000375fb8 pc=0x7a564050fead
runtime.goexit({})
        /usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc000375fe8 sp=0xc000375fe0 pc=0x7a5640510101

goroutine 17 gp=0xc000006700 m=1 mp=0xc000074008 [syscall, 14 minutes, locked to thread]:
runtime.goexit({})
        /usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc000084fe8 sp=0xc000084fe0 pc=0x7a5640510101

goroutine 2 gp=0xc000006c40 m=nil [force gc (idle), 2 minutes]:
runtime.gopark(0x16a869b8f87459?, 0x0?, 0x0?, 0x0?, 0x0?)
        /usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc00006efa8 sp=0xc00006ef88 pc=0x7a56404dd4ee
runtime.goparkunlock(...)
        /usr/local/go/src/runtime/proc.go:408
runtime.forcegchelper()
        /usr/local/go/src/runtime/proc.go:326 +0xb8 fp=0xc00006efe0 sp=0xc00006efa8 pc=0x7a56404dd378
runtime.goexit({})
        /usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc00006efe8 sp=0xc00006efe0 pc=0x7a5640510101
created by runtime.init.6 in goroutine 1
        /usr/local/go/src/runtime/proc.go:314 +0x1a

goroutine 3 gp=0xc000007180 m=nil [GC sweep wait]:
runtime.gopark(0x1?, 0x0?, 0x0?, 0x0?, 0x0?)
        /usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc00006f780 sp=0xc00006f760 pc=0x7a56404dd4ee
runtime.goparkunlock(...)
        /usr/local/go/src/runtime/proc.go:408
runtime.bgsweep(0xc00007e000)
        /usr/local/go/src/runtime/mgcsweep.go:318 +0xdf fp=0xc00006f7c8 sp=0xc00006f780 pc=0x7a56404c799f
runtime.gcenable.gowrap1()
        /usr/local/go/src/runtime/mgc.go:203 +0x25 fp=0xc00006f7e0 sp=0xc00006f7c8 pc=0x7a56404bc265
runtime.goexit({})
        /usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc00006f7e8 sp=0xc00006f7e0 pc=0x7a5640510101
created by runtime.gcenable in goroutine 1
        /usr/local/go/src/runtime/mgc.go:203 +0x66

goroutine 4 gp=0xc000007340 m=nil [runnable]:
runtime.gopark(0x10000?, 0xce5fa?, 0x0?, 0x0?, 0x0?)
        /usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc00006ff78 sp=0xc00006ff58 pc=0x7a56404dd4ee
runtime.goparkunlock(...)
        /usr/local/go/src/runtime/proc.go:408
runtime.(*scavengerState).park(0x7a5642b8d6a0)
        /usr/local/go/src/runtime/mgcscavenge.go:425 +0x49 fp=0xc00006ffa8 sp=0xc00006ff78 pc=0x7a56404c5329
runtime.bgscavenge(0xc00007e000)
        /usr/local/go/src/runtime/mgcscavenge.go:658 +0x59 fp=0xc00006ffc8 sp=0xc00006ffa8 pc=0x7a56404c58d9
runtime.gcenable.gowrap2()
        /usr/local/go/src/runtime/mgc.go:204 +0x25 fp=0xc00006ffe0 sp=0xc00006ffc8 pc=0x7a56404bc205
runtime.goexit({})
        /usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc00006ffe8 sp=0xc00006ffe0 pc=0x7a5640510101
created by runtime.gcenable in goroutine 1
        /usr/local/go/src/runtime/mgc.go:204 +0xa5

goroutine 18 gp=0xc000102700 m=nil [finalizer wait, 2 minutes]:
runtime.gopark(0x0?, 0xc0000123a8?, 0x50?, 0x60?, 0x1000000010?)
        /usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc000086e20 sp=0xc000086e00 pc=0x7a56404dd4ee
runtime.runfinq()
        /usr/local/go/src/runtime/mfinal.go:194 +0x107 fp=0xc000086fe0 sp=0xc000086e20 pc=0x7a56404bb227
runtime.goexit({})
        /usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc000086fe8 sp=0xc000086fe0 pc=0x7a5640510101
created by runtime.createfing in goroutine 1
        /usr/local/go/src/runtime/mfinal.go:164 +0x3d

goroutine 35 gp=0xc00051e000 m=nil [GC worker (idle), 9 minutes]:
runtime.gopark(0x16a7e55ad07540?, 0x0?, 0x0?, 0x0?, 0x0?)
        /usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc000524750 sp=0xc000524730 pc=0x7a56404dd4ee
runtime.gcBgMarkWorker()
        /usr/local/go/src/runtime/mgc.go:1310 +0xe5 fp=0xc0005247e0 sp=0xc000524750 pc=0x7a56404be365
runtime.goexit({})
        /usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc0005247e8 sp=0xc0005247e0 pc=0x7a5640510101
created by runtime.gcBgMarkStartWorkers in goroutine 1
        /usr/local/go/src/runtime/mgc.go:1234 +0x1c

goroutine 5 gp=0xc0000e61c0 m=nil [GC worker (idle), 4 minutes]:
runtime.gopark(0x16a82ac903a842?, 0x0?, 0x0?, 0x0?, 0x0?)
        /usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc000070750 sp=0xc000070730 pc=0x7a56404dd4ee
runtime.gcBgMarkWorker()
        /usr/local/go/src/runtime/mgc.go:1310 +0xe5 fp=0xc0000707e0 sp=0xc000070750 pc=0x7a56404be365
runtime.goexit({})
        /usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc0000707e8 sp=0xc0000707e0 pc=0x7a5640510101
created by runtime.gcBgMarkStartWorkers in goroutine 1
        /usr/local/go/src/runtime/mgc.go:1234 +0x1c

goroutine 6 gp=0xc0000e6380 m=nil [GC worker (idle), 2 minutes]:
runtime.gopark(0x16a869b91816cf?, 0x1?, 0x89?, 0x1?, 0x0?)
        /usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc000070f50 sp=0xc000070f30 pc=0x7a56404dd4ee
runtime.gcBgMarkWorker()
        /usr/local/go/src/runtime/mgc.go:1310 +0xe5 fp=0xc000070fe0 sp=0xc000070f50 pc=0x7a56404be365
runtime.goexit({})
        /usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc000070fe8 sp=0xc000070fe0 pc=0x7a5640510101
created by runtime.gcBgMarkStartWorkers in goroutine 1
        /usr/local/go/src/runtime/mgc.go:1234 +0x1c

goroutine 7 gp=0xc0000e6540 m=nil [GC worker (idle), 4 minutes]:
runtime.gopark(0x16a82ab7114739?, 0xc00004a4e0?, 0x1a?, 0xa?, 0x0?)
        /usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc000071750 sp=0xc000071730 pc=0x7a56404dd4ee
runtime.gcBgMarkWorker()
        /usr/local/go/src/runtime/mgc.go:1310 +0xe5 fp=0xc0000717e0 sp=0xc000071750 pc=0x7a56404be365
runtime.goexit({})
        /usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc0000717e8 sp=0xc0000717e0 pc=0x7a5640510101
created by runtime.gcBgMarkStartWorkers in goroutine 1
        /usr/local/go/src/runtime/mgc.go:1234 +0x1c