elastic / beats

:tropical_fish: Beats - Lightweight shippers for Elasticsearch & Logstash
https://www.elastic.co/products/beats
Other
109 stars 4.93k forks source link

[Linux] SIGSEGV: segmentation violation during cgo execution of cgoLookupIP and getaddrinfo #41398

Open cmacknz opened 1 month ago

cmacknz commented 1 month ago

We have an internal example of multiple Beats failing shortly after startup with a segmentation fault in CGO code. The exact path leading to this is not clear yet because the problem is in CGO, although we do have the stack trace which is attached.

{"log.level":"info","@timestamp":"2024-10-18T15:10:23.373Z","message":"running under elastic-agent, per-beat lockfiles disabled","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"filestream-monitoring","type":"filestream"},"log":{"source":"filestream-monitoring"},"service.name":"filebeat","ecs.version":"1.6.0","log.origin":{"file.line":443,"file.name":"instance/beat.go","function":"github.com/elastic/beats/v7/libbeat/cmd/instance.(*Beat).launch"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-10-18T15:10:23.374Z","message":"Starting stats endpoint","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"filestream-monitoring","type":"filestream"},"log":{"source":"filestream-monitoring"},"log.logger":"api","log.origin":{"file.line":69,"file.name":"api/server.go","function":"github.com/elastic/beats/v7/libbeat/api.(*Server).Start"},"service.name":"filebeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-10-18T15:10:23.374Z","message":"Syscall filter successfully installed","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"filestream-monitoring","type":"filestream"},"log":{"source":"filestream-monitoring"},"log.logger":"seccomp","log.origin":{"file.line":125,"file.name":"seccomp/seccomp.go","function":"github.com/elastic/beats/v7/libbeat/common/seccomp.loadFilter"},"service.name":"filebeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-10-18T15:10:23.374Z","message":"Beat info","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"filestream-monitoring","type":"filestream"},"log":{"source":"filestream-monitoring"},"service.name":"filebeat","system_info":{"beat":{"path":{"config":"/opt/Elastic/Agent/data/elastic-agent-8.15.2-621bbc/components","data":"/opt/Elastic/Agent/data/elastic-agent-8.15.2-621bbc/run/filestream-monitoring","home":"/opt/Elastic/Agent/data/elastic-agent-8.15.2-621bbc/components","logs":"/opt/Elastic/Agent/data/elastic-agent-8.15.2-621bbc/components/logs"},"type":"filebeat","uuid":"5a0b058b-04d4-4e07-b5cd-3a4aef38a2f7"},"ecs.version":"1.6.0"},"log.logger":"beat","log.origin":{"file.line":1385,"file.name":"instance/beat.go","function":"github.com/elastic/beats/v7/libbeat/cmd/instance.logSystemInfo"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-10-18T15:10:23.374Z","message":"Build info","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"filestream-monitoring","type":"filestream"},"log":{"source":"filestream-monitoring"},"log.logger":"beat","log.origin":{"file.line":1394,"file.name":"instance/beat.go","function":"github.com/elastic/beats/v7/libbeat/cmd/instance.logSystemInfo"},"service.name":"filebeat","system_info":{"build":{"commit":"26daf71e4ec87172523af7f0e916cba9f79dc0d0","libbeat":"8.15.2","time":"2024-09-19T09:24:35.000Z","version":"8.15.2"},"ecs.version":"1.6.0"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-10-18T15:10:23.374Z","message":"Go runtime info","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"filestream-monitoring","type":"filestream"},"log":{"source":"filestream-monitoring"},"log.logger":"beat","log.origin":{"file.line":1397,"file.name":"instance/beat.go","function":"github.com/elastic/beats/v7/libbeat/cmd/instance.logSystemInfo"},"service.name":"filebeat","system_info":{"ecs.version":"1.6.0","go":{"arch":"amd64","max_procs":8,"os":"linux","version":"go1.22.6"}},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-10-18T15:10:23.375Z","message":"Host info","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"filestream-monitoring","type":"filestream"},"log":{"source":"filestream-monitoring"},"system_info":{"ecs.version":"1.6.0","host":{"architecture":"x86_64","boot_time":"2024-10-18T11:12:02+02:00","containerized":false,"id":"3fe2439e8486446eabcfaac351556a64","ip":["127.0.0.1","::1","10.0.0.45","fd00::9250:6d5f:2a99:b767","fe80::2078:f5bd:8159:2e29","10.0.0.47","fd00::9402:7f04:e6ae:472c","fe80::14c1:3059:f370:301a"],"kernel_version":"6.11.3-arch1-1","mac":["f8:75:a4:52:86:80","f8:75:a4:52:86:7f","24:41:8c:35:dd:51"],"name":"antiope","native_architecture":"x86_64\n","os":{"build":"rolling","family":"arch","major":0,"minor":0,"name":"Arch Linux","patch":0,"platform":"arch","type":"linux","version":""},"timezone":"CEST","timezone_offset_sec":7200}},"log.logger":"beat","log.origin":{"file.line":1403,"file.name":"instance/beat.go","function":"github.com/elastic/beats/v7/libbeat/cmd/instance.logSystemInfo"},"service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-10-18T15:10:23.375Z","message":"Process info","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"filestream-monitoring","type":"filestream"},"log":{"source":"filestream-monitoring"},"log.logger":"beat","log.origin":{"file.line":1432,"file.name":"instance/beat.go","function":"github.com/elastic/beats/v7/libbeat/cmd/instance.logSystemInfo"},"service.name":"filebeat","system_info":{"ecs.version":"1.6.0","process":{"capabilities":{"ambient":null,"bounding":["chown","dac_override","dac_read_search","fowner","fsetid","kill","setgid","setuid","setpcap","linux_immutable","net_bind_service","net_broadcast","net_admin","net_raw","ipc_lock","ipc_owner","sys_module","sys_rawio","sys_chroot","sys_ptrace","sys_pacct","sys_admin","sys_boot","sys_nice","sys_resource","sys_time","sys_tty_config","mknod","lease","audit_write","audit_control","setfcap","mac_override","mac_admin","syslog","wake_alarm","block_suspend","audit_read","perfmon","bpf","checkpoint_restore"],"effective":["chown","dac_override","dac_read_search","fowner","fsetid","kill","setgid","setuid","setpcap","linux_immutable","net_bind_service","net_broadcast","net_admin","net_raw","ipc_lock","ipc_owner","sys_module","sys_rawio","sys_chroot","sys_ptrace","sys_pacct","sys_admin","sys_boot","sys_nice","sys_resource","sys_time","sys_tty_config","mknod","lease","audit_write","audit_control","setfcap","mac_override","mac_admin","syslog","wake_alarm","block_suspend","audit_read","perfmon","bpf","checkpoint_restore"],"inheritable":null,"permitted":["chown","dac_override","dac_read_search","fowner","fsetid","kill","setgid","setuid","setpcap","linux_immutable","net_bind_service","net_broadcast","net_admin","net_raw","ipc_lock","ipc_owner","sys_module","sys_rawio","sys_chroot","sys_ptrace","sys_pacct","sys_admin","sys_boot","sys_nice","sys_resource","sys_time","sys_tty_config","mknod","lease","audit_write","audit_control","setfcap","mac_override","mac_admin","syslog","wake_alarm","block_suspend","audit_read","perfmon","bpf","checkpoint_restore"]},"cwd":"/opt/Elastic/Agent/data/elastic-agent-8.15.2-621bbc/run/filestream-monitoring","exe":"/opt/Elastic/Agent/data/elastic-agent-8.15.2-621bbc/components/agentbeat","name":"agentbeat","pid":611948,"ppid":600393,"seccomp":{"mode":"filter","no_new_privs":true},"start_time":"2024-10-18T17:10:22.500+0200"}},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-10-18T15:10:23.376Z","message":"Setup Beat: filebeat; Version: 8.15.2","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"filestream-monitoring","type":"filestream"},"log":{"source":"filestream-monitoring"},"log.origin":{"file.line":341,"file.name":"instance/beat.go","function":"github.com/elastic/beats/v7/libbeat/cmd/instance.(*Beat).createBeater"},"service.name":"filebeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-10-18T15:10:23.376Z","message":"Metrics endpoint listening on: /opt/Elastic/Agent/data/tmp/xTEtpJ7117ppc6OYvJCaYHbDW8mLjXGe.sock (configured: unix:///opt/Elastic/Agent/data/tmp/xTEtpJ7117ppc6OYvJCaYHbDW8mLjXGe.sock)","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"filestream-monitoring","type":"filestream"},"log":{"source":"filestream-monitoring"},"service.name":"filebeat","ecs.version":"1.6.0","log.logger":"api","log.origin":{"file.line":71,"file.name":"api/server.go","function":"github.com/elastic/beats/v7/libbeat/api.(*Server).Start.func1"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-10-18T15:10:23.376Z","message":"Output is configured through Central Management","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"filestream-monitoring","type":"filestream"},"log":{"source":"filestream-monitoring"},"service.name":"filebeat","ecs.version":"1.6.0","log.origin":{"file.line":373,"file.name":"instance/beat.go","function":"github.com/elastic/beats/v7/libbeat/cmd/instance.(*Beat).createBeater"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-10-18T15:10:23.378Z","message":"Beat name: antiope","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"filestream-monitoring","type":"filestream"},"log":{"source":"filestream-monitoring"},"log.logger":"publisher","log.origin":{"file.line":105,"file.name":"pipeline/module.go","function":"github.com/elastic/beats/v7/libbeat/publisher/pipeline.LoadWithSettings"},"service.name":"filebeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-10-18T15:10:23.381Z","message":"SIGSEGV: segmentation violation","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"filestream-monitoring","type":"filestream"},"log":{"source":"filestream-monitoring"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-10-18T15:10:23.381Z","message":"PC=0x0 m=4 sigcode=1 addr=0x0","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"filestream-monitoring","type":"filestream"},"log":{"source":"filestream-monitoring"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-10-18T15:10:23.381Z","message":"signal arrived during cgo execution","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"filestream-monitoring","type":"filestream"},"log":{"source":"filestream-monitoring"},"ecs.version":"1.6.0"}

cgo_segfault.json

elasticmachine commented 1 month ago

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

cmacknz commented 1 month ago

Possibly relates to:

mauri870 commented 1 month ago

Briefly looking at the logs I can see references such as net.cgoLookupHostIP, this is the C netdns resolver. We could opt-in to use the netgo resolver.

Edit: The crash seems to be triggered in the call to reflect.implements https://github.com/elastic/go-ucfg/blob/4fd3937/initializer.go#L39C29-L39C39

rdner commented 1 month ago

Does the issue happen if GODEBUG=netdns=go set?

mauri870 commented 1 month ago

Does the issue happen if GODEBUG=netdns=go set?

Also wondering about this. The cgo resolver uses threads so in high contention scenarios the netgo resolver might perform better by leveraging goroutines.

cmacknz commented 1 month ago

Does the issue happen if GODEBUG=netdns=go set?

Confirmed that setting GODEBUG=netdns=go stops this from happening.

rdner commented 1 month ago

There is a chance that this PR will fix it https://github.com/elastic/beats/pull/41402 The PR updates glibc from 2.28 to 2.31.

pierrehilbert commented 3 weeks ago

Sorry I didn't follow up with this topic. Did your PR fix the issue?

rdner commented 3 weeks ago

@pierrehilbert needs to be tested, it's quite hard to reproduce but I can try. This change was not included in 8.16 due to the product decision, so the only option is to build Filebeat from sources and run it in the Linux environment where the crash happens.

@weltenwort would you mind to share your OS configuration, so I can reproduce the environment? Or perhaps you'd be willing to test it yourself?

We need your Linux distribution, version, glibc version, etc.

weltenwort commented 3 weeks ago

Hi @rdner 👋

I'm running arch with the stock kernel 6.11.5-arch1-1 #1 SMP PREEMPT_DYNAMIC Tue, 22 Oct 2024 18:31:38 +0000 x86_64 GNU/Linux and glibc version 2.40+r16+gaa533d58ff-2, but only because I was on PTO for a few days. The kernel will certainly have been updated by now.

If you can assist me in setting up the build environment I could certainly test it on my machine.

rdner commented 3 weeks ago

@pierrehilbert I just had a call with @weltenwort and the issue persists despite the glibc update (2.28 to 2.31) in Beats 8.17 binaries. Looks like the only action we can take now is to update the documentation and tell users to use GODEBUG=netdns=go if this crash occurs.

Good news is that it's a stable deterministic crash, not a flaky behavior.

Steps to reproduce

OS: ArchLinux Linux Kernel: 6.11.5-arch1-1 #1 SMP PREEMPT_DYNAMIC Tue, 22 Oct 2024 18:31:38 +0000 x86_64 GNU/Linux glibc: 2.40+r16+gaa533d58ff-2

  1. We need to test against a remote ES, the easiest way is Elastic Cloud: create a deployment.
  2. Use this filebeat.yml configuration and follow instructions in the comments:
filebeat.inputs:
  - type: filestream
    id: my-filestream-id
    enabled: true
    paths:
      - "/var/log/*.log" # check if you have matching files on your machine, change if necessary
path.data: "/tmp/filebeat" # so, nothing is left after the test runs

logging:
  level: debug # can be noisy but we would like to see everything

output.elasticsearch:
  # in case you build from sources, disables the compatibility check
  allow_older_versions: true
  # Create an API key, pick the Beats format (!!!) and copy it to the config file
  api_key: "<FIRST>:<SECOND>"
  # Open the deployment management and copy the Elasticsearch endpoint to the config file
  hosts: ["https://<HOSTNAME>:443"] # keep the 443 port.
  1. Run ./filebeat -e -c ./filebeat.yml 2> output.json

In case the issue is there, filebeat will stop almost instantly and you will see the following stacktrace at the end of output.json. The stacktrace is identical between 8.16 and 8.17 versions of Beats:

SIGSEGV: segmentation violation
PC=0x0 m=4 sigcode=1 addr=0x0
signal arrived during cgo execution

goroutine 52 gp=0xc00023c8c0 m=4 mp=0xc0000c1808 [syscall]:
runtime.cgocall(0x64f288625900, 0xc000f8b5a8)
    runtime/cgocall.go:157 +0x4b fp=0xc000f8b580 sp=0xc000f8b548 pc=0x64f2840a42cb
net._C2func_getaddrinfo(0xc000d2ad30, 0x0, 0xc00113a270, 0xc0000be848)
    _cgo_gotypes.go:105 +0x59 fp=0xc000f8b5a8 sp=0xc000f8b580 pc=0x64f28438e259
net._C_getaddrinfo.func1(0xc000d2ad30, 0x0, 0xc00113a270, 0xc0000be848)
    net/cgo_unix_cgo.go:78 +0x7a fp=0xc000f8b5f0 sp=0xc000f8b5a8 pc=0x64f28438ec5a
net._C_getaddrinfo(0xc0000141b0?, 0x9?, 0x0?, 0x0?)
    net/cgo_unix_cgo.go:78 +0x13 fp=0xc000f8b620 sp=0xc000f8b5f0 pc=0x64f28438eb93
net.cgoLookupHostIP({0x64f28862873e, 0x3}, {0xc0000141b0, 0x9})
    net/cgo_unix.go:168 +0x228 fp=0xc000f8b760 sp=0xc000f8b620 pc=0x64f284358028
net.cgoLookupIP.func1()
    net/cgo_unix.go:217 +0x25 fp=0xc000f8b790 sp=0xc000f8b760 pc=0x64f284358745
net.doBlockingWithCtx[...].func1()
    net/cgo_unix.go:56 +0x32 fp=0xc000f8b7e0 sp=0xc000f8b790 pc=0x64f28438efb2
runtime.goexit({})
    runtime/asm_amd64.s:1695 +0x1 fp=0xc000f8b7e8 sp=0xc000f8b7e0 pc=0x64f28411a8e1
created by net.doBlockingWithCtx[...] in goroutine 51
    net/cgo_unix.go:54 +0xd8