Open cmacknz opened 1 month ago
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)
Briefly looking at the logs I can see references such as net.cgoLookupHostIP
, this is the C netdns resolver. We could opt-in to use the netgo resolver.
Edit: The crash seems to be triggered in the call to reflect.implements
https://github.com/elastic/go-ucfg/blob/4fd3937/initializer.go#L39C29-L39C39
Does the issue happen if GODEBUG=netdns=go
set?
Does the issue happen if
GODEBUG=netdns=go
set?
Also wondering about this. The cgo resolver uses threads so in high contention scenarios the netgo resolver might perform better by leveraging goroutines.
Does the issue happen if GODEBUG=netdns=go set?
Confirmed that setting GODEBUG=netdns=go
stops this from happening.
There is a chance that this PR will fix it https://github.com/elastic/beats/pull/41402 The PR updates glibc from 2.28 to 2.31.
Sorry I didn't follow up with this topic. Did your PR fix the issue?
@pierrehilbert needs to be tested, it's quite hard to reproduce but I can try. This change was not included in 8.16 due to the product decision, so the only option is to build Filebeat from sources and run it in the Linux environment where the crash happens.
@weltenwort would you mind to share your OS configuration, so I can reproduce the environment? Or perhaps you'd be willing to test it yourself?
We need your Linux distribution, version, glibc version, etc.
Hi @rdner 👋
I'm running arch with the stock kernel 6.11.5-arch1-1 #1 SMP PREEMPT_DYNAMIC Tue, 22 Oct 2024 18:31:38 +0000 x86_64 GNU/Linux
and glibc
version 2.40+r16+gaa533d58ff-2
, but only because I was on PTO for a few days. The kernel will certainly have been updated by now.
If you can assist me in setting up the build environment I could certainly test it on my machine.
@pierrehilbert I just had a call with @weltenwort and the issue persists despite the glibc update (2.28 to 2.31) in Beats 8.17 binaries. Looks like the only action we can take now is to update the documentation and tell users to use GODEBUG=netdns=go
if this crash occurs.
Good news is that it's a stable deterministic crash, not a flaky behavior.
OS: ArchLinux
Linux Kernel: 6.11.5-arch1-1 #1 SMP PREEMPT_DYNAMIC Tue, 22 Oct 2024 18:31:38 +0000 x86_64 GNU/Linux
glibc: 2.40+r16+gaa533d58ff-2
filebeat.yml
configuration and follow instructions in the comments:filebeat.inputs:
- type: filestream
id: my-filestream-id
enabled: true
paths:
- "/var/log/*.log" # check if you have matching files on your machine, change if necessary
path.data: "/tmp/filebeat" # so, nothing is left after the test runs
logging:
level: debug # can be noisy but we would like to see everything
output.elasticsearch:
# in case you build from sources, disables the compatibility check
allow_older_versions: true
# Create an API key, pick the Beats format (!!!) and copy it to the config file
api_key: "<FIRST>:<SECOND>"
# Open the deployment management and copy the Elasticsearch endpoint to the config file
hosts: ["https://<HOSTNAME>:443"] # keep the 443 port.
./filebeat -e -c ./filebeat.yml 2> output.json
In case the issue is there, filebeat
will stop almost instantly and you will see the following stacktrace at the end of output.json
. The stacktrace is identical between 8.16 and 8.17 versions of Beats:
SIGSEGV: segmentation violation
PC=0x0 m=4 sigcode=1 addr=0x0
signal arrived during cgo execution
goroutine 52 gp=0xc00023c8c0 m=4 mp=0xc0000c1808 [syscall]:
runtime.cgocall(0x64f288625900, 0xc000f8b5a8)
runtime/cgocall.go:157 +0x4b fp=0xc000f8b580 sp=0xc000f8b548 pc=0x64f2840a42cb
net._C2func_getaddrinfo(0xc000d2ad30, 0x0, 0xc00113a270, 0xc0000be848)
_cgo_gotypes.go:105 +0x59 fp=0xc000f8b5a8 sp=0xc000f8b580 pc=0x64f28438e259
net._C_getaddrinfo.func1(0xc000d2ad30, 0x0, 0xc00113a270, 0xc0000be848)
net/cgo_unix_cgo.go:78 +0x7a fp=0xc000f8b5f0 sp=0xc000f8b5a8 pc=0x64f28438ec5a
net._C_getaddrinfo(0xc0000141b0?, 0x9?, 0x0?, 0x0?)
net/cgo_unix_cgo.go:78 +0x13 fp=0xc000f8b620 sp=0xc000f8b5f0 pc=0x64f28438eb93
net.cgoLookupHostIP({0x64f28862873e, 0x3}, {0xc0000141b0, 0x9})
net/cgo_unix.go:168 +0x228 fp=0xc000f8b760 sp=0xc000f8b620 pc=0x64f284358028
net.cgoLookupIP.func1()
net/cgo_unix.go:217 +0x25 fp=0xc000f8b790 sp=0xc000f8b760 pc=0x64f284358745
net.doBlockingWithCtx[...].func1()
net/cgo_unix.go:56 +0x32 fp=0xc000f8b7e0 sp=0xc000f8b790 pc=0x64f28438efb2
runtime.goexit({})
runtime/asm_amd64.s:1695 +0x1 fp=0xc000f8b7e8 sp=0xc000f8b7e0 pc=0x64f28411a8e1
created by net.doBlockingWithCtx[...] in goroutine 51
net/cgo_unix.go:54 +0xd8
We have an internal example of multiple Beats failing shortly after startup with a segmentation fault in CGO code. The exact path leading to this is not clear yet because the problem is in CGO, although we do have the stack trace which is attached.
cgo_segfault.json