SIGSEGV running Rust tests under kcov

mthebridge commented 7 years ago

I'm hitting an issue where running a Rust test binary under kcov segfaults, but running the binary natively is fine:

$ LD_LIBRARY_PATH=/opt/rust/.multirust/toolchains/stable-x86_64-unknown-linux-gnu/lib/ kcov --exclude-pattern=/opt/rust/.cargo --verify /data/mlt/rustproj/slog-extlog/slog-extlog-derive/target/cov/slog_extlog_derive-33e2f62c06364b1a/ /data/mlt/rustproj/slog-extlog/target/debug/deps/slog_extlog_derive-33e2f62c06364b1a 
kcov: 94 invalid breakpoints skipped in /opt/rust/.multirust/toolchains/stable-x86_64-unknown-linux-gnu/lib/libstd-f4594d3e53dcb114.so
kcov: Process exited with signal 11 (SIGSEGV) at 0x7ffff7105131

Obviously due to the way kcov works I can't run it in a debugger, but if I dump a core and load that in gdb I can see the segfault is somewhere in jemalloc in thread initialization.

Unfortunately I can't seem to generate a simple example for repro, and I can't share the underlying code, but running under kcov v33 works as expected, indicating this is a regression - and indeed git bisect found that the commit at fault is e620da497dfcf93eefef6e582c68b68b9303f84c, which fixed #180. Annoyingly, the reason I want to run off master and not from the v33 release is because I want to be able to exclude lines within my project!

I appreciate there's not a lot to go on here - let me know if there's any useful diagnsotics I can send!

SimonKagstrom commented 7 years ago

Don't be so humble! You've even found the faulty commit! That said, it's a bit surprising that this particular commit causes a crash: it shouldn't change breakpoint setting very much.

Do you use the #180 feature? I.e., via LCOV_EXCL_START?

Is this consistent across every run, or does it sometimes work?

mthebridge commented 7 years ago

Thanks for the speedy reply! It doesn't matter whether I'm using LCOV_EXCL_START or not, and yes it's 100% consistent - always fails on master, always works with v33...

SimonKagstrom commented 7 years ago

And the commit immediately before e620da4 (i.e., 860a324) always work? Sorry for doubting git bisect a bit ;-)

mthebridge commented 7 years ago

Yes. I just manually checked out those two revs and compared.

On e620da4:

$ cargo kcov
error: failed to get coverage
caused by: signal: 11

On 860a324:

$ cargo kcov
kcov: 0 invalid breakpoints skipped...
<snipped>

test result: ok. 2 passed; 0 failed; 0 ignored; 0 measured

SimonKagstrom commented 7 years ago

OK, great!

If you run kcov with --debug=15, you should be able to see how/if the breakpoint setting changes between the two versions. I think there must be something wrong related to breakpoint setting, so that a breakpoint gets misplaced (i.e. what --verify works around) and therefore causes the crash.

The output will be quite long, but should be possible to pass to diff.

mthebridge commented 7 years ago

Hmm - if I set debug=15 then both cases fail! The previously working case gives me:

$ cargo kcov -- --debug=15 > kcov_debug.out
error: failed to get coverage
caused by: exit code: 255

And the end of the debug output for this case is:

PT stopped PID 8671 0x0000057f
PT BP at 0x7ffff71050cd:-1 for 8671
REPORT hit at 0x7ffff71050cd
PT continuing 8671 with signal 0
PT stopped PID 8671 0x0000057f
PT BP at 0x7ffff71050db:-1 for 8671
REPORT hit at 0x7ffff71050db
PT continuing 8671 with signal 0
PT stopped PID 8671 0x00000b7f
PT signal 11 at 0x7ffff71050da for 8671
PT continuing 8671 with signal 11
PT stopped PID 8671 0x0000000b
PT terminating signal 11 at 0x7ffff71050da for 8671
PT continuing 8671 with signal 0
PT error for 8671: -1
Returning error

The previously broken case has 7000 lines of debug output, the previously working case 220000 - I assume you don't want to see 200k lines of diff! The 7000 lines in the "broken" case are the same as the first 7000 in the "working" case, but for the "broken" one the output just stops mid-line. No obvious errors in the "broken" version.

SimonKagstrom commented 7 years ago

OK, strange. That almost sounds like kcov itself is crashing (as in #207) if it stops mid-line. Do you get a core file for kcov in that case?

I'm busy with real-life things right now (repainting a room), but I'll take a deeper look at the commit after that to see if I can figure out why it breaks things.

mthebridge commented 7 years ago

No problem about the delay - enjoy your redecorating!

You're right - running it with debug causes kcov itself to crash, and I do get a core.

Output from loading in gdb (running kcov at e620da4, Linux x86_64):

$ gdb `which kcov` core.7908 
<snip>
Core was generated by `kcov --debug=15 /data/mlt/rustproj/slog-extlog/slog-extlog-derive/target/cov/sl'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007f7e4b234295 in std::string::find(char const*, unsigned long, unsigned long) const () from /lib64/libstdc++.so.6
(gdb) bt
#0  0x00007f7e4b234295 in std::string::find(char const*, unsigned long, unsigned long) const () from /lib64/libstdc++.so.6
#1  0x0000000000430fd1 in find (__pos=0, __str="LCOV_EXCL_LINE", this=0x0) at /usr/include/c++/4.8.2/bits/basic_string.h:1848
#2  match (line=<error reading variable: Cannot access memory at address 0x0>, lineNr=<optimized out>, filePath=..., this=0x24eb0d0) at /data/mlt/kcov/src/filter.cc:144
#3  Filter::runLineFilters (this=<optimized out>, filePath=..., lineNr=<optimized out>, line=<error reading variable: Cannot access memory at address 0x0>) at /data/mlt/kcov/src/filter.cc:107
#4  0x000000000043dcca in Reporter::onLine (this=0x24eae90, file="/checkout/src/libcore/lib.rs", lineNr=1, addr=93824992326304) at /data/mlt/kcov/src/reporter.cc:328
#5  0x0000000000450892 in ElfInstance::onLine (this=0x85c020 <g_instance>, file=..., lineNr=1, addr=94880) at /data/mlt/kcov/src/parsers/elf-parser.cc:692
#6  0x0000000000453d7d in kcov::DwarfParser::forEachLine (this=this@entry=0x7fff4bebd430, listener=...) at /data/mlt/kcov/src/parsers/dwarf.cc:112
#7  0x00000000004509b0 in ElfInstance::parseOneDwarf (this=this@entry=0x85c020 <g_instance>, relocation=relocation@entry=93824992231424) at /data/mlt/kcov/src/parsers/elf-parser.cc:420
#8  0x0000000000452b2d in ElfInstance::doParse (this=0x85c020 <g_instance>, relocation=93824992231424) at /data/mlt/kcov/src/parsers/elf-parser.cc:305
#9  0x000000000045535b in checkSolibData (this=0x24eb9d0) at /data/mlt/kcov/src/solib-handler.cc:228
#10 SolibHandler::onTick (this=0x24eb9d0) at /data/mlt/kcov/src/solib-handler.cc:78
#11 0x000000000041ab0c in tick (this=0x24eb220) at /data/mlt/kcov/src/collector.cc:72
#12 Collector::run (this=0x24eb220, filename=...) at /data/mlt/kcov/src/collector.cc:53
#13 0x0000000000418d6a in main (argc=<optimized out>, argv=<optimized out>) at /data/mlt/kcov/src/main.cc:298

Looks like a null pointer dereference to me - seems that FileLineHandler::match doesn't cope with being passed a null string for line. Do shout if you want me to send you the core (suspect it;s too large to atatch here).

SimonKagstrom commented 7 years ago

Yes, that sounds like it!

Is the /checkout/src/libcore/lib.rs present in your system?

Anyway, with that finding, it should be pretty easy to fix the problem. Good catching!

SimonKagstrom commented 7 years ago

Hmm... match takes a reference for line, so it really shouldn't be null. Could you mail me the source code of lib.rs? simon.kagstrom@gmail.com is the address.

I'll see if I can reproduce this in a unit test, although I'll have to warn you that it might be a few days until I look at this again.

mthebridge commented 7 years ago

Well, libcore/lib.rs is from the core Rust library, so not something I have my own copy of. I imagine it's this file though!

SimonKagstrom commented 7 years ago

Well, something is strange if the source code isn't present on your system: Then it should not be able to read the file to start with. I've read the code of source-file-cache.cc again, but I can't see that it should be possible to end up in this situation.

Could you mail me the core file?

mthebridge commented 7 years ago

Thanks Simon - I've sent you a mail.

SimonKagstrom commented 7 years ago

Unfortunately, I guess I would need your kcov binary as well (matching the core file), I don't get any symbols in the backtrace. One other thing though: Could you build kcov in debug-mode, i.e., configure it with

cmake -DCMAKE_BUILD_TYPE=Debug <...>

? Perhaps the backtrace would be easier to follow in this case. At least the "optimized out" stuff should be gone then.

mthebridge commented 7 years ago

The debug build is a good call - I may have got somewhere with that:

(gdb) up 3
#3  0x00000000004f2bfb in Reporter::onLine (this=0x1e14f20, file="/checkout/src/libcore/lib.rs", lineNr=1, addr=93824992326304) at /data/mlt/kcov/src/reporter.cc:328
328                 !m_filter.runLineFilters(file, lineNr, lines[lineNr - 1]))
(gdb) p file
$10 = "/checkout/src/libcore/lib.rs"
(gdb) p *fp
$11 = {m_fileHash = 6018595602966914235, m_lines = std::vector of length 2, capacity 2 = {0x0, 0x2a451c0}, m_nrLines = 1}
(gdb) p *line
$12 = {m_addrs = std::vector of length 0, capacity 0, m_lineId = 2419036335428861953, m_order = 0, m_unreachable = false}
(gdb) p lines
$13 = std::vector of length 0, capacity 0

I wonder if the fact that file doesn't actually exist at the path specified is the problem? The /checkout/src/libcore is part of the Rust compiler code itself, not my code - but my Rust library does have a compiler plugin.

Looks like if that is the case, then the call to ISourceFileCache::getInstance().getLines(file) in reporter.cc is going to return an empty vector...

SimonKagstrom commented 7 years ago

Well, it should return an empty vector, but then the number of lines should be zero, and not 1 as in this case. I think that's probably the root of the problem.

Not at the computer now, but I think the m_empty member in source-file-cache.cc maybe isn't set up correctly.

mthebridge commented 7 years ago

OK, so this issue with the null reference is a bit of a red herring. :(

The test I see at that commit is:

if (lines.size() >= lineNr - 1 &&
                !m_filter.runLineFilters(file, lineNr, lines[lineNr - 1]))

And lines.size() will be 0, but lineNr is 1. If I change the code to explicitly also cover the case when `lines.size() is 0, then everything passes...

BUT you've fixed this in master, by switching to a for loop:

for (unsigned int nr = 1; nr <= lines.size(); nr++) {
                if (!m_filter.runLineFilters(file, lineNr, lines[nr - 1]))

And running from the master, kcov itself doesn't segfault, the instrumented binary does.

However I switched back to running with --debug=15 and that fails even back obn v33!

So I suspect that this second bug isn't a regression, it's always been a problem, but somehow before the introduction of the line filters it only manifests with debug tracing on.

The end of the debug trace shows:

REPORT hit at 0x7ffff71050db
PT continuing 13400 with signal 0
PT stopped PID 13400 0x00000b7f
PT signal 11 at 0x7ffff71050da for 13400
PT continuing 13400 with signal 11
PT stopped PID 13400 0x0000008b
PT terminating signal 11 at 0x7ffff71050da for 13400

And 0x7ffff71050db is inside malloc()...

SimonKagstrom commented 7 years ago

OK, too bad - I had forgot that change myself. Sounds like some kind of race then, but difficult to say where it might be.

Things to test:

Run kcov with --skip-solibs
Does the crashing code have source code? (I guess the one calling malloc from Rust libs) If so, maybe try filtering it out with --exclude-path

Is the crash the same with --debug=15 for both v33 and master?

mthebridge commented 7 years ago

skip-solibs doesn't crash, but that's because of how Rust test binaries are built - by default everything is a library, and the test harness builds a minimal wrapper around it, so while it doesn't crash the coverage report is empty.

Yes, I get the same problem on master and v33.

Using --exclude-path makes no difference, but using --include-path $(pwd) to only cover my code DOES work. That's a workaround I can use easily enough - so if you want to close this issue, then go ahead!

One other thought - it's possible that this is actually a Rust bug of course, and rustc is generating bad debugging information in the binary...

SimonKagstrom commented 7 years ago

Well, normally you'd want to only cover your own code anyway, but it should still work I think.

Does --verify report any skipped breakpoints?

mthebridge commented 7 years ago

Sorry for delayed replay - --verify does report skipped breakpoints, but that happens for all Rust crates. I wonder if this is just another symptom of https://github.com/rust-lang/rust/issues/27324 ?

SimonKagstrom commented 7 years ago

OK, good. At least that's working as it should then.

It almost sounds like a timing issue I'd say. kcov (as you've no doubt seen!) tends to slow down execution quite a bit, and maybe that triggers some race somewhere?

zetok commented 7 years ago

Hi, just dropping by to say that I'm too affected by this. Looks like rustc bump (1.19.0 → 1.20.0) broke things completely:

Last working build: https://travis-ci.org/zetok/tox/jobs/270343484#L2215 Build with segfault: https://travis-ci.org/zetok/tox/jobs/271103907#L2367

SimonKagstrom commented 7 years ago

Does the workaround @mthebridge uses (--include-pattern=<path-to-your-own-source-code>) help for you as well?

zetok commented 7 years ago

@SimonKagstrom That works, thanks!

luser commented 6 years ago

One other thought - it's possible that this is actually a Rust bug of course, and rustc is generating bad debugging information in the binary...

FWIW, I dealt with a similar situation with Rust code in Firefox, and even contributed a patch to rustc to make things better. Generics in Rust do not get actually compiled until you instantiate them with a concrete type, and the Rust standard library has a lot of generics. This means that when you use like Option<Foo>, where Foo is defined in your code, the Rust compile will compile that object code into your binary when your crate is compiled. Additionally, since Rust doesn't support dynamic linking, the rest of the standard library will be statically linked into your final binary. This means that there are a lot of things that wind up in your binary for which the source paths will map back to the Rust standard library in whatever path it was built for the Rust toolchain you're using, and you are not going to have source for those.

It's possible to encounter this in C code, but less likely due to the fact that many C libraries are dynamically linked. If you were to link an external static library that contained debug info I'm pretty sure you'd hit the same issue, though.

SimonKagstrom commented 6 years ago

Thanks for the insight into rust!

Still, I don't think the missing source code as such is really the problem - that's quite common for C/C++ as well, and present no problem by itself. But the volume of instrumented code might certainly increase the chances of triggering a case of bad debug info.

This problem would have been so much simpler if x86 - with variable instruction lengths - weren't so prevalent.

Dushistov commented 6 years ago

I hitted by this issue today.

The steps to reproduce are simple:

$ cargo new --lib foo && cd foo && cargo test
...
$ export TEST_BIN=`find target/debug/ -executable -type f | grep 'target/debug/foo'`
$ /tmp/kcov/build/src/kcov target/cov $TEST_BIN
kcov: Process exited with signal 11 (SIGSEGV) at 0x5555555b9497

version of software that I use:

$ cargo --version
cargo 0.26.0-nightly (1d6dfea44 2018-01-26)
$ /tmp/kcov/build/src/kcov --version
kcov v34-25-g894e
$ uname -a
Linux retina 4.14.22-1-lts #1 SMP Sun Feb 25 15:13:49 CET 2018 x86_64 GNU/Linux

SimonKagstrom commented 6 years ago

@Dushistov try running kcov with --verify, it seems to be needed pretty much always with rust binaries.

Dushistov commented 6 years ago

@SimonKagstrom

Sorryk forgot to mention, I've tried both variants, but I hope that and without any flag it should not crash.

$ /tmp/kcov/build/src/kcov --verify  target/cov target/debug/foo-7cf2e66c10f1fab7
kcov: 213 invalid breakpoints skipped in /tmp/foo/target/debug/foo-7cf2e66c10f1fab7

running 1 test
test tests::it_works ... ok
test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

$ /tmp/kcov/build/src/kcov --include-path .  target/cov target/debug/foo-7cf2e66c10f1fab7

running 1 test
test tests::it_works ... ok
test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

SimonKagstrom commented 6 years ago

So it works with --verify and limiting the source path? Issue #82 explains a bit more in detail why verify is needed, it's basically a workaround for broken DWARF info. Limiting the source path probably means that the locations where the DWARF info points to the middle of instructions are simply ignored.

I agree it should work without any extra options, but since this appears to be quite common (not only with Rust), the only alternative would be to make --verify default. That would slow down execution fairly significantly though for the cases where it's not needed.

Arzte commented 6 years ago

Perhaps a warning noting that if it's not needed --noverify can be passed?

fschutt commented 6 years ago

The segfault occurs when you don't have libiberity-dev in your packages. To fix it:

addons:
  apt:
     update: true
     - libiberity-dev

For some reason, kcov can build without libiberity, but the segfaults at runtime.

SimonKagstrom commented 6 years ago

@fschutt kcov detects if binutils-dev is installed, and only then builds with disassembly, so it's optional.

It gives a warning though if you run with --verify without the disassembler. Rust binaries appear to be especially sensitive to this, so they typically don't work at all without discarding the invalid breakpoints.

@Arzte running without --verify is the default, so if you don't give any options it will run without verification. Sorry about not seeing your reply until now!

SimonKagstrom / kcov

SIGSEGV running Rust tests under kcov #212