elastic / otel-profiling-agent

The production-scale datacenter profiler (C/C++, Go, Rust, Python, Java, NodeJS, .NET, PHP, Ruby, Perl, ...)
Apache License 2.0
2.09k stars 229 forks source link

failed to load unwind_stop #11

Closed guolifu closed 2 months ago

guolifu commented 2 months ago

I tried to run it, but the program reported an error. I tested on three different kernels and found that this issue exists in all of them. centos7.9(5.4.219)Ubuntu(5.15-101、5.15-102)

root@ubuntu2204:/opt/github/otel-profiling-agent# ./otel-profiling-agent -disable-tls -bpf-log-level 2 -verbose -collection-agent=127.0.0.1:4317
time="2024-04-17T16:02:29.307804792Z" level=debug msg="Config:"
time="2024-04-17T16:02:29.307939433Z" level=debug msg="bpf-log-level: 2"
time="2024-04-17T16:02:29.307954760Z" level=debug msg="bpf-log-size: 65536"
time="2024-04-17T16:02:29.307963528Z" level=debug msg="cache-directory: /var/cache/otel/profiling-agent"
time="2024-04-17T16:02:29.307969970Z" level=debug msg="collection-agent: 127.0.0.1:4317"
time="2024-04-17T16:02:29.307976497Z" level=debug msg="config: /etc/otel/profiling-agent/agent.conf"
time="2024-04-17T16:02:29.307983778Z" level=debug msg="copyright: false"
time="2024-04-17T16:02:29.307989514Z" level=debug msg="disable-tls: true"
time="2024-04-17T16:02:29.307995965Z" level=debug msg="map-scale-factor: 0"
time="2024-04-17T16:02:29.308001980Z" level=debug msg="no-kernel-version-check: false"
time="2024-04-17T16:02:29.308009793Z" level=debug msg="probabilistic-interval: 1m0s"
time="2024-04-17T16:02:29.308015631Z" level=debug msg="probabilistic-threshold: 100"
time="2024-04-17T16:02:29.308021349Z" level=debug msg="project-id: 1"
time="2024-04-17T16:02:29.308027161Z" level=debug msg="secret-token: abc123"
time="2024-04-17T16:02:29.308033036Z" level=debug msg="t: all"
time="2024-04-17T16:02:29.308045268Z" level=debug msg="tags: "
time="2024-04-17T16:02:29.308053173Z" level=debug msg="tracers: all"
time="2024-04-17T16:02:29.308060595Z" level=debug msg="v: true"
time="2024-04-17T16:02:29.308068020Z" level=debug msg="verbose: true"
time="2024-04-17T16:02:29.308076287Z" level=debug msg="version: false"
time="2024-04-17T16:02:29.308088410Z" level=info msg="Starting OTEL profiling agent 1.0.0 (revision OTEL-review, build timestamp N/A)"
time="2024-04-17T16:02:29.338636218Z" level=debug msg="Validated tags: "
time="2024-04-17T16:02:29.345399048Z" level=debug msg="Traffic to 127.0.0.1 is routed from 127.0.0.1"
time="2024-04-17T16:02:29.345491693Z" level=debug msg="Reading the configuration"
time="2024-04-17T16:02:29.345521530Z" level=debug msg="Done setting configuration"
time="2024-04-17T16:02:29.345531389Z" level=debug msg="Determining tracers to include"
time="2024-04-17T16:02:29.345541098Z" level=debug msg="Tracer string: all"
time="2024-04-17T16:02:29.345547283Z" level=info msg="Interpreter tracers: perl,php,python,hotspot,ruby,v8"
time="2024-04-17T16:02:29.345553853Z" level=info msg="Automatically determining environment and machine ID ..."
time="2024-04-17T16:02:32.353551340Z" level=debug msg="Environment tester (azure) failed: failed to get azure metadata: Get \"http://169.254.169.254/metadata/instance/compute?api-version=2020-09-01&format=json\": dial tcp 169.254.169.254:80: i/o timeout (Client.Timeout exceeded while awaiting headers)"
time="2024-04-17T16:02:32.952574146Z" level=debug msg="Environment tester (aws) failed: failed to get aws metadata: EC2MetadataRequestError: failed to get EC2 instance identity document\ncaused by: RequestError: send request failed\ncaused by: Get \"http://169.254.169.254/latest/dynamic/instance-identity/document\": dial tcp 169.254.169.254:80: connect: connection refused"
time="2024-04-17T16:02:35.486060964Z" level=debug msg="Environment tester (gcp) failed: failed to get GCP metadata: Get \"http://169.254.169.254/computeMetadata/v1/instance/id\": dial tcp 169.254.169.254:80: connect: connection refused"
time="2024-04-17T16:02:35.488877079Z" level=debug msg="Using MAC: 0x7EC699421C00"
time="2024-04-17T16:02:35.488927322Z" level=info msg="Environment: hardware, machine ID: 0xa1cd2c49e140a32f"
time="2024-04-17T16:02:35.489000264Z" level=info msg="Assigned ProjectID: 1 HostID: 1282730164693803823"
time="2024-04-17T16:02:35.492509164Z" level=debug msg="Traffic to 127.0.0.1 is routed from 127.0.0.1"
time="2024-04-17T16:02:35.661166452Z" level=debug msg="Size of eBPF map exe_id_to_10_stack_deltas: 65536"
time="2024-04-17T16:02:35.663494266Z" level=debug msg="Size of eBPF map exe_id_to_17_stack_deltas: 65536"
time="2024-04-17T16:02:35.666997122Z" level=debug msg="Size of eBPF map exe_id_to_15_stack_deltas: 65536"
time="2024-04-17T16:02:35.668244665Z" level=debug msg="Size of eBPF map exe_id_to_9_stack_deltas: 65536"
time="2024-04-17T16:02:35.669474442Z" level=debug msg="Size of eBPF map exe_id_to_16_stack_deltas: 65536"
time="2024-04-17T16:02:35.670690660Z" level=debug msg="Size of eBPF map exe_id_to_18_stack_deltas: 65536"
time="2024-04-17T16:02:35.672137478Z" level=debug msg="Size of eBPF map stack_delta_page_to_info: 65536"
time="2024-04-17T16:02:35.674369999Z" level=debug msg="Size of eBPF map exe_id_to_19_stack_deltas: 65536"
time="2024-04-17T16:02:35.676152405Z" level=debug msg="Size of eBPF map pid_page_to_mapping_info: 1048576"
time="2024-04-17T16:02:35.676185824Z" level=debug msg="Size of eBPF map exe_id_to_11_stack_deltas: 65536"
time="2024-04-17T16:02:35.677494926Z" level=debug msg="Size of eBPF map exe_id_to_13_stack_deltas: 65536"
time="2024-04-17T16:02:35.678897886Z" level=debug msg="Size of eBPF map exe_id_to_14_stack_deltas: 65536"
time="2024-04-17T16:02:35.680522577Z" level=debug msg="Size of eBPF map exe_id_to_21_stack_deltas: 65536"
time="2024-04-17T16:02:35.684883263Z" level=debug msg="Size of eBPF map exe_id_to_8_stack_deltas: 65536"
time="2024-04-17T16:02:35.686383625Z" level=debug msg="Size of eBPF map exe_id_to_12_stack_deltas: 65536"
time="2024-04-17T16:02:35.687660328Z" level=debug msg="Size of eBPF map exe_id_to_20_stack_deltas: 65536"
time="2024-04-17T16:02:35.694821056Z" level=error msg="load program: no space left on device: 174: (79) r1 = *(u64 *)(r0 +0): R0_w=map_value(id=0,off=0,ks=4,vs=8,imm=0) R6=c (truncated, 557 line(s) omitted)"
time="2024-04-17T16:02:35.694876996Z" level=error msg="Failed to load eBPF tracer: failed to load eBPF code: failed to load eBPF programs: failed to load unwind_stop"
root@ubuntu2204:/opt/github/otel-profiling-agent# ./otel-profiling-agent -disable-tls -bpf-log-level 2 -verbose -collection-agent=127.0.0.1:4317
time="2024-04-17T16:02:39.694513357Z" level=debug msg="Config:"
time="2024-04-17T16:02:39.694580570Z" level=debug msg="bpf-log-level: 2"
time="2024-04-17T16:02:39.694588005Z" level=debug msg="bpf-log-size: 65536"
time="2024-04-17T16:02:39.694593251Z" level=debug msg="cache-directory: /var/cache/otel/profiling-agent"
time="2024-04-17T16:02:39.694597447Z" level=debug msg="collection-agent: 127.0.0.1:4317"
time="2024-04-17T16:02:39.694601336Z" level=debug msg="config: /etc/otel/profiling-agent/agent.conf"
time="2024-04-17T16:02:39.694606120Z" level=debug msg="copyright: false"
time="2024-04-17T16:02:39.694609660Z" level=debug msg="disable-tls: true"
time="2024-04-17T16:02:39.694613448Z" level=debug msg="map-scale-factor: 0"
time="2024-04-17T16:02:39.694619625Z" level=debug msg="no-kernel-version-check: false"
time="2024-04-17T16:02:39.694724803Z" level=debug msg="probabilistic-interval: 1m0s"
time="2024-04-17T16:02:39.694731576Z" level=debug msg="probabilistic-threshold: 100"
time="2024-04-17T16:02:39.694735492Z" level=debug msg="project-id: 1"
time="2024-04-17T16:02:39.694739222Z" level=debug msg="secret-token: abc123"
time="2024-04-17T16:02:39.694742845Z" level=debug msg="t: all"
time="2024-04-17T16:02:39.694774500Z" level=debug msg="tags: "
time="2024-04-17T16:02:39.694779276Z" level=debug msg="tracers: all"
time="2024-04-17T16:02:39.694783254Z" level=debug msg="v: true"
time="2024-04-17T16:02:39.694786710Z" level=debug msg="verbose: true"
time="2024-04-17T16:02:39.694790439Z" level=debug msg="version: false"
time="2024-04-17T16:02:39.694795228Z" level=info msg="Starting OTEL profiling agent 1.0.0 (revision OTEL-review, build timestamp N/A)"
time="2024-04-17T16:02:39.726198380Z" level=debug msg="Validated tags: "
time="2024-04-17T16:02:39.728971090Z" level=debug msg="Traffic to 127.0.0.1 is routed from 127.0.0.1"
time="2024-04-17T16:02:39.729163735Z" level=debug msg="Reading the configuration"
time="2024-04-17T16:02:39.729240813Z" level=debug msg="Done setting configuration"
time="2024-04-17T16:02:39.729259307Z" level=debug msg="Determining tracers to include"
time="2024-04-17T16:02:39.729322476Z" level=debug msg="Tracer string: all"
time="2024-04-17T16:02:39.729352425Z" level=info msg="Interpreter tracers: perl,php,python,hotspot,ruby,v8"
time="2024-04-17T16:02:39.729362097Z" level=info msg="Automatically determining environment and machine ID ..."
time="2024-04-17T16:02:39.730364517Z" level=debug msg="Environment tester (azure) failed: failed to get azure metadata: Get \"http://169.254.169.254/metadata/instance/compute?api-version=2020-09-01&format=json\": dial tcp 169.254.169.254:80: connect: connection refused"
time="2024-04-17T16:02:40.304691298Z" level=debug msg="Environment tester (aws) failed: failed to get aws metadata: EC2MetadataRequestError: failed to get EC2 instance identity document\ncaused by: RequestError: send request failed\ncaused by: Get \"http://169.254.169.254/latest/dynamic/instance-identity/document\": dial tcp 169.254.169.254:80: connect: connection refused"
time="2024-04-17T16:02:40.672310810Z" level=debug msg="Environment tester (gcp) failed: failed to get GCP metadata: Get \"http://169.254.169.254/computeMetadata/v1/instance/id\": dial tcp 169.254.169.254:80: connect: connection refused"
time="2024-04-17T16:02:40.675907642Z" level=debug msg="Using MAC: 0x7EC699421C00"
time="2024-04-17T16:02:40.676237785Z" level=info msg="Environment: hardware, machine ID: 0xa1cd2c49e140a32f"
time="2024-04-17T16:02:40.676469358Z" level=info msg="Assigned ProjectID: 1 HostID: 1282730164693803823"
time="2024-04-17T16:02:40.679220758Z" level=debug msg="Traffic to 127.0.0.1 is routed from 127.0.0.1"
time="2024-04-17T16:02:40.857821007Z" level=debug msg="Size of eBPF map exe_id_to_14_stack_deltas: 65536"
time="2024-04-17T16:02:40.859576805Z" level=debug msg="Size of eBPF map exe_id_to_17_stack_deltas: 65536"
time="2024-04-17T16:02:40.861715648Z" level=debug msg="Size of eBPF map exe_id_to_18_stack_deltas: 65536"
time="2024-04-17T16:02:40.863167962Z" level=debug msg="Size of eBPF map exe_id_to_20_stack_deltas: 65536"
time="2024-04-17T16:02:40.865345841Z" level=debug msg="Size of eBPF map exe_id_to_9_stack_deltas: 65536"
time="2024-04-17T16:02:40.866614450Z" level=debug msg="Size of eBPF map exe_id_to_11_stack_deltas: 65536"
time="2024-04-17T16:02:40.868232931Z" level=debug msg="Size of eBPF map exe_id_to_16_stack_deltas: 65536"
time="2024-04-17T16:02:40.870837090Z" level=debug msg="Size of eBPF map exe_id_to_8_stack_deltas: 65536"
time="2024-04-17T16:02:40.872606707Z" level=debug msg="Size of eBPF map exe_id_to_15_stack_deltas: 65536"
time="2024-04-17T16:02:40.876671278Z" level=debug msg="Size of eBPF map exe_id_to_12_stack_deltas: 65536"
time="2024-04-17T16:02:40.878139667Z" level=debug msg="Size of eBPF map exe_id_to_21_stack_deltas: 65536"
time="2024-04-17T16:02:40.881467085Z" level=debug msg="Size of eBPF map pid_page_to_mapping_info: 1048576"
time="2024-04-17T16:02:40.881500935Z" level=debug msg="Size of eBPF map exe_id_to_13_stack_deltas: 65536"
time="2024-04-17T16:02:40.883343122Z" level=debug msg="Size of eBPF map exe_id_to_19_stack_deltas: 65536"
time="2024-04-17T16:02:40.885034327Z" level=debug msg="Size of eBPF map stack_delta_page_to_info: 65536"
time="2024-04-17T16:02:40.886369072Z" level=debug msg="Size of eBPF map exe_id_to_10_stack_deltas: 65536"
time="2024-04-17T16:02:40.893914976Z" level=error msg="load program: no space left on device: 174: (79) r1 = *(u64 *)(r0 +0): R0_w=map_value(id=0,off=0,ks=4,vs=8,imm=0) R6=c (truncated, 557 line(s) omitted)"
time="2024-04-17T16:02:40.893953659Z" level=error msg="Failed to load eBPF tracer: failed to load eBPF code: failed to load eBPF programs: failed to load unwind_stop"
root@ubuntu2204:/opt/github/otel-profiling-agent#

WechatIMG2934

inge4pres commented 2 months ago

hey @guolifu 😄 This line

load program: no space left on device: 174:

seems to indicate that the BPF programs cannot be loaded due to limited resources. Not sure exactly where the shortage is, pretty sure it's not on disk and the message is misleading.

I tested on three different kernels

Are you using VMs? If yes, how are they configured?

athre0z commented 2 months ago

That's weird. The bpf syscall is documented to return that status code under the following circumstances:

The eBPF program is too large or a map reached the
max_entries limit (maximum number of elements).

All kernel versions that you listed should have a limit of 1M instructions whereas we are trying to stay below 4K instructions (for compat with old kernels), so this is probably not it. At the same time, the map sizes that are being printed in your verbose output also don't look outrageously large.

What architecture is this? How many cores does the machine have? How did you build, make agent (uses Docker) or just plain make (uses host LLVM toolchain)?

guolifu commented 2 months ago

That's weird. The bpf syscall is documented to return that status code under the following circumstances:

The eBPF program is too large or a map reached the
max_entries limit (maximum number of elements).

All kernel versions that you listed should have a limit of 1M instructions whereas we are trying to stay below 4K instructions (for compat with old kernels), so this is probably not it. At the same time, the map sizes that are being printed in your verbose output also don't look outrageously large.

What architecture is this? How many cores does the machine have? How did you build, make agent (uses Docker) or just plain make (uses host LLVM toolchain)?

One of them is this

root@ubuntu2204:/opt/github/euspace# uname -a
Linux ubuntu2204 5.15.0-101-generic #111-Ubuntu SMP Tue Mar 5 20:16:58 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
cpu cores : 2
root@ubuntu2204:/opt/github/euspace# ulimit -a
real-time non-blocking time  (microseconds, -R) unlimited
core file size              (blocks, -c) 0
data seg size               (kbytes, -d) unlimited
scheduling priority                 (-e) 0
file size                   (blocks, -f) unlimited
pending signals                     (-i) 7373
max locked memory           (kbytes, -l) 250420
max memory size             (kbytes, -m) unlimited
open files                          (-n) 1024
pipe size                (512 bytes, -p) 8
POSIX message queues         (bytes, -q) 819200
real-time priority                  (-r) 0
stack size                  (kbytes, -s) 8192
cpu time                   (seconds, -t) unlimited
max user processes                  (-u) 7373
virtual memory              (kbytes, -v) unlimited
file locks                          (-x) unlimited

make agent (uses Docker)

  1. make docker-image
  2. make agent
rockdaboot commented 2 months ago

@guolifu I could reproduce the issue in a fresh Ubuntu22.04 VM. The work-around is to drop the -bpf-log-level=2 option. We have to investigate what exactly happens when the log level is set.

guolifu commented 2 months ago

@guolifu我可以在新的 Ubuntu22.04 VM 中重现该问题。 解决方法是删除该-bpf-log-level=2选项。 我们必须调查设置日志级别时到底会发生什么。

You are right. Setting -bpf-log-level to either 1 or 2 will cause problems, while 0 will not cause any problems. It's strange.😂

rockdaboot commented 2 months ago

You have to set the -bpf-log-size values high enough (possibly depends on kernel version).

Works here on 6.7.9-amd64: -bpf-log-level=1 -bpf-log-size=524288 and -bpf-log-level=2 -bpf-log-size=8388608

guolifu commented 2 months ago

You have to set the -bpf-log-size values high enough (possibly depends on kernel version).

Works here on 6.7.9-amd64: -bpf-log-level=1 -bpf-log-size=524288 and -bpf-log-level=2 -bpf-log-size=8388608

Your help was very much appreciated,I'll give it a try👍