SIGSEGV/incorrect coverage when running kcov on rust proc-macro crates on x86_64-unknown-linux-musl target

roypat commented 1 year ago

Hello,

We recently started investigating some CI failures on a pull request to firecracker. Particularly, kcov seems to exit with kcov: Process exited with signal 11 (SIGSEGV).

We believe we managed to track down the issue to two culprits, the x86_64-unknown-linux-musl target that we build for, and the presence of procedural macros (there's the outliner of kcov also segfaulting inside the docker container we run our CI tests in, even with the toolchain specified as ...-gnu). Particularly, we managed to reproduce the segmentation fault on a different crate of ours which also provides procedural macros. On a fresh ubuntu 22.04.1 install that only has rust (with the musl target) and kcov installed, running the following commands

git clone https://github.com/firecracker-microvm/versionize_derive
cd versionize_derive
cargo check
cargo kcov --target=x86_64-unknown-linux-musl -- --verify

yields

kcov: 316 invalid breakpoints skipped in /home/ec2-user/versionize_derive/target/debug/deps/versionize_derive-b5dcc0d1e2c6e9c4

running 1 test
test common::tests::test_exists_at ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.01s

kcov: Process exited with signal 11 (SIGSEGV) at 0xffffdeefffffffff
error: failed to get coverage
caused by: exit status: 255

Upon testing different procedural macro crates (the two in the firecracker PR, a minimal crashing example here and rocket) on the musl target, we observed that kcov doesn't always segfault, but it seems to always produce the invalid breakpoints, so these two things seem to be the common denominator here.

Running with --debug=4 yields a bunch of statements of the form kcov: Address 0x19ab is not at an instruction boundary, skipping, and running with --debug=15 yields output that ends with

0x56403af151e0 REPORT hit at 0x7ffff7abae78
PT continuing 20072 with signal 0
PT stopped PID 20072 0x00000b7f
PT signal 11 at 0x55555555532f for 20072
PT continuing 20072 with signal 11
PT stopped PID 20072 0x0000008b
PT terminating signal 11 at 0x55555555532f for 20072
PT continuing 20072 with signal 0
PT error for 20072: -1
Returning error

similar to #212 and #153. While the there proposed solution of limiting kcov to only run on the source directory, e.g.

cargo kcov --all --target x86_64-unknown-linux-musl -- --verify --include-pattern=$(pwd)

prevents the segmentation fault from occuring, it does not fix the invalid breakpoints. The output becomes

kcov: 316 invalid breakpoints skipped in /home/ec2-user/versionize_derive/target/debug/deps/versionize_derive-b5dcc0d1e2c6e9c4

running 1 test
test common::tests::test_exists_at ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

This is an issue for us, since these invalid breakpoint cause the generation of incorrect coverage reports. Specifically, running the above command in the srcdirectory of versionize_derive results in a coverage report that states 100% coverage, but only lists 17 lines as instrumented, even though the project is multiple hundred lines long.

SimonKagstrom commented 1 year ago

Thanks for the report, and I really appreciate the work you put behind reproducing it! Unfortunately, I think this is really more of a compiler bug than for kcov, although kcov tries to work around it. Since this case has so many invalid instructions, perhaps there is some systematic issue with the dwarf generation that triggers it? Perhaps it would be a good idea to report it to the toolchain people then.

It would be slightly interesting to see the disassembly of the addresses where the crash occurs. kcov (via binutils libbfd) disassembles the binaries to try to avoid this case, but apparently it misses these places.

I haven't seen so many invalid breakpoints in so few lines before, so it's not strange in this case that coverage collection also breaks down as a consequence.

roypat commented 1 year ago

Thanks for the quick response! Yeah, I was afraid it would be a rustc issue, you're probably right that something goes fundamentally wrong with the generation of the debug symbols for crates linking against proc-macro (we have another crate where kcov reports thousands of invalid breakpoints). I'll try to investigate the generated debug symbols and open an issue on the rust-lang repository then!

I'm using the minimal binary produced by https://github.com/roypat/break-kcov for the following debug info:

Running kcov through gdb yields

``` (gdb) run --verify . /home/ec2-user/break-kcov/target/debug/deps/break_kcov-051db159f39ecb65 Starting program: /home/ec2-user/.cargo/bin/kcov --verify . /home/ec2-user/break-kcov/target/debug/deps/break_kcov-051db159f39ecb65 [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". [New Thread 0x7ffff2071700 (LWP 37895)] Detaching after fork from child process 37896. kcov: 287 invalid breakpoints skipped in /home/ec2-user/break-kcov/target/debug/deps/break_kcov-051db159f39ecb65 kcov: 3 invalid breakpoints skipped in /lib64/libc.so.6 kcov: 1 invalid breakpoints skipped in /lib64/libdl.so.2 kcov: 3 invalid breakpoints skipped in /lib64/libpthread.so.0 kcov: 1 invalid breakpoints skipped in /lib64/libm.so.6 kcov: 2 invalid breakpoints skipped in /lib64/librt.so.1 running 0 tests test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.01s kcov: Process exited with signal 11 (SIGSEGV) at 0x555555400a9b Thread 2 "kcov" received signal SIGTERM, Terminated. [Switching to Thread 0x7ffff2071700 (LWP 37895)] 0x00007ffff7849f1b in __libc_open64 (file=0xda0e20 ".//break_kcov-051db159f39ecb65.bd4880ab41b0eae9/kcov-solib.pipe", oflag=oflag@entry=0) at ../sysdeps/unix/sysv/linux/open64.c:46 46 return SYSCALL_CANCEL (openat, AT_FDCWD, file, oflag | EXTRA_OPEN_FLAGS, (gdb) bt #0 0x00007ffff7849f1b in __libc_open64 (file=0xda0e20 ".//break_kcov-051db159f39ecb65.bd4880ab41b0eae9/kcov-solib.pipe", oflag=oflag@entry=0) at ../sysdeps/unix/sysv/linux/open64.c:46 #1 0x000000000045c65f in SolibHandler::solibThreadParse (this=0xd344e0) at /home/ec2-user/break-kcov/kcov-v36/src/solib-handler.cc:153 #2 0x000000000045c7f2 in SolibHandler::solibThreadMain (this=0xd344e0) at /home/ec2-user/break-kcov/kcov-v36/src/solib-handler.cc:194 #3 SolibHandler::threadStatic (pThis=0xd344e0) at /home/ec2-user/break-kcov/kcov-v36/src/solib-handler.cc:203 #4 0x00007ffff784044b in start_thread (arg=0x7ffff2071700) at pthread_create.c:465 #5 0x00007ffff650c56f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 ```

Analysing the coredump the SIGSEGV produces with gdb yields

``` Dwarf Error: Cannot handle DW_FORM_ in DWARF reader [in module /home/ec2-user/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/libstd-5670385a2fe8b60b.so] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/home/ec2-user/break-kcov/target/debug/deps/break_kcov-051db159f39ecb65'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x0000555555400a9c in _fini () warning: Missing auto-load script at offset 0 in section .debug_gdb_scripts of file /home/ec2-user/break-kcov/target/debug/deps/break_kcov-051db159f39ecb65. Use `info auto-load python-scripts [REGEXP]' to list them. (gdb) bt #0 0x0000555555400a9c in _fini () Backtrace stopped: Cannot access memory at address 0xffffe148 ```

Running objdump -d on the binary in question yields

0000000000000a94 <_fini>:
 a94:   48 83 ec 08             sub    $0x8,%rsp
 a98:   48 83 c4 08             add    $0x8,%rsp
 a9c:   c3                      retq

My best guess would be that the breakpoint somehow gets placed in the data part of either the add or the sub instruction, which corrupts the stack by moving the stack pointer too much, leading to retq to pop some nonsense address from the stack to jump back to?

roypat commented 1 year ago

Okay, should anyone else also run into this issue: when invoking rustc through cargo during cross-compilation, it ignores the value of the RUSTFLAGS environment variable. This variable is used to pass the -Clink-dead-code codegen option to rustc, which tells it to not eliminate dead code (due to how proc-macros work in rust, this would happen quite extensively otherewise). This is why the coverage reports only include so few instrumented lines.

Furthermore, without this option rust seems to generate invalid DWARF debug data in the sense that there's way to much of it, which is the reason for the kcov warnings.

I am closing this issue, since it's not a kcov bug, but rather an unfortunate interaction between a cargo feature and seemingly a rustc bug.

SimonKagstrom commented 1 year ago

Thanks a lot for the detailed analysis! I'm not a rust developer myself, but apparently many Rust users have these issues with kcov. I'll refer to your findings for hints on how to work around them - it sounds like this can be one of the root causes for many of these issues.

SimonKagstrom / kcov

SIGSEGV/incorrect coverage when running kcov on rust proc-macro crates on x86_64-unknown-linux-musl target #386