SimonKagstrom / kcov

Code coverage tool for compiled programs, Python and Bash which uses debugging information to collect and report data without special compilation options
http://simonkagstrom.github.io/kcov/
GNU General Public License v2.0
713 stars 109 forks source link

kcov hanging on arm64 #267

Closed kannaiah closed 4 years ago

kannaiah commented 6 years ago

Tried kcov built for source and it is hanging.

Below is debug log using max level 31

REPORT /build/glibc-Ceg4s8/glibc-2.24/nptl/../csu/errno-loc.c:25 at 0x7fb7997610
REPORT /build/glibc-Ceg4s8/glibc-2.24/nptl/../csu/errno-loc.c:26 at 0x7fb7997610
BP registered at 0x7fb799761c
REPORT /build/glibc-Ceg4s8/glibc-2.24/nptl/../csu/errno-loc.c:27 at 0x7fb799761c
REPORT /build/glibc-Ceg4s8/glibc-2.24/nptl/../sysdeps/aarch64/crtn.S:41 at 0x7fb798904c
BP registered at 0x7fb7989050
REPORT /build/glibc-Ceg4s8/glibc-2.24/nptl/../sysdeps/aarch64/crtn.S:42 at 0x7fb7989050
REPORT /build/glibc-Ceg4s8/glibc-2.24/nptl/../sysdeps/aarch64/crtn.S:45 at 0x7fb799768c
BP registered at 0x7fb7997690
REPORT /build/glibc-Ceg4s8/glibc-2.24/nptl/../sysdeps/aarch64/crtn.S:46 at 0x7fb7997690
PT continuing 18871 with signal 0
PT stopped PID 18871 0x0000057f
PT BP at 0x0:-1 for 18871
Looking for instruction 0.
Can't find breakpoint at 0x0
PT continuing 18871 with signal 0
PT stopped PID 18871 0x0000057f
PT BP at 0x0:-1 for 18871
Looking for instruction 0.
Can't find breakpoint at 0x0
PT continuing 18871 with signal 0
PT stopped PID 18871 0x0000057f
PT BP at 0x0:-1 for 18871
Looking for instruction 0.
Can't find breakpoint at 0x0
PT continuing 18871 with signal 0
PT stopped PID 18871 0x0000057f
PT BP at 0x0:-1 for 18871
Looking for instruction 0.
Can't find breakpoint at 0x0
PT continuing 18871 with signal 0
PT stopped PID 18871 0x0000057f
PT BP at 0x0:-1 for 18871
Looking for instruction 0.
Can't find breakpoint at 0x0
PT continuing 18871 with signal 0
PT stopped PID 18871 0x0000057f
PT BP at 0x0:-1 for 18871
Looking for instruction 0.
Can't find breakpoint at 0x0
PT continuing 18871 with signal 0
PT stopped PID 18871 0x0000057f
PT BP at 0x0:-1 for 18871
Looking for instruction 0.
Can't find breakpoint at 0x0
PT continuing 18871 with signal 0
PT stopped PID 18871 0x0000057f
PT BP at 0x0:-1 for 18871
SimonKagstrom commented 6 years ago

Thanks for the report! I believe you're the first one who's actually tried it on ARM64. It was implemented blindly since I don't have access to the actual hardware.

Anyway, it's almost certainly an error with how breakpoints are set, or how the PC is read out. I believe it's the latter since it looks for address 0x0. It's really just a constant, which you can find in src/engines/ptrace_linux.cc. The aarch64_pc value points into a vector, and is something I tried to read out from the kernel sources, but is apparently wrong.

If you feel like it, feel free to experiment a bit with it and see if it works better.

kannaiah commented 6 years ago

I think aarch64_PC value should be 32. I was able to verify that using a simple test. With that change, kcov progressed further but crashed at the end.

excerpt from strace output, kcov run on ls

ptrace(PTRACE_GETREGSET, 22110, NT_PRSTATUS, [{iov_base=0x7fe529f4a0, iov_len=272}]) = 0
ptrace(PTRACE_GETREGSET, 22110, NT_PRSTATUS, [{iov_base=0x7fe529f4a0, iov_len=272}]) = 0
ptrace(PTRACE_SETREGSET, 22110, NT_PRSTATUS, [{iov_base=0x7fe529f4a0, iov_len=8192}]) = 0
ptrace(PTRACE_PEEKTEXT, 22110, 0x7fb7fe0718, [0xd4200000]) = 0
ptrace(PTRACE_POKETEXT, 22110, 0x7fb7fe0718, 0xf8336b4017ffff29) = 0
ptrace(PTRACE_CONT, 22110, NULL, SIG_0) = 0
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_TRAPPED, si_pid=22110, si_uid=0, si_status=SIGSEGV, si_utime=0, si_stime=0} ---
wait4(-1, [{WIFSTOPPED(s) && WSTOPSIG(s) == SIGSEGV}], __WALL, NULL) = 22110
ptrace(PTRACE_GETREGSET, 22110, NT_PRSTATUS, [{iov_base=0x7fe529f4a0, iov_len=272}]) = 0
ptrace(PTRACE_CONT, 22110, NULL, SIGSEGV) = 0
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=22110, si_uid=0, si_status=SIGSEGV, si_utime=0, si_stime=1} ---
wait4(-1, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGSEGV}], __WALL, NULL) = 22110
ptrace(PTRACE_GETREGSET, 22110, NT_PRSTATUS, [{iov_base=0x7fe529f4a0, iov_len=8192}]) = -1 ESRCH (No such process)
ptrace(PTRACE_CONT, 22110, NULL, SIG_0) = -1 ESRCH (No such process)
kill(22110, SIGTERM)                    = -1 ESRCH (No such process)
ptrace(PTRACE_DETACH, 22110, NULL, SIG_0) = -1 ESRCH (No such process)

I also tested with a simple test case which crashed initially but when I specified --include-path= to include only the test source it worked fine

SimonKagstrom commented 6 years ago

Unfortunately, ptrace isn't recursive, so kcov won't work at all when run under strace since it uses ptrace.

Anyway, very good progress! Looking forward to a pull request in the future :-)

kannaiah commented 6 years ago

Could you have a look at the debug log, if something is obvious from the crash? tracecrash.txt

SimonKagstrom commented 6 years ago

Hmm... It does look pretty good:

PT continuing 22409 with signal 0
PT stopped PID 22409 0x0000057f
PT BP at 0x7fb7fe0718:-1 for 22409
Looking for instruction 7fb7fe0718.
Can't find breakpoint at 0x7fb7fe0718
PT continuing 22409 with signal 0
PT stopped PID 22409 0x0000047f
PT BP at 0x7fb7fe071c:-1 for 22409
Looking for instruction 7fb7fe071c.
REPORT hit at 0x7fb7fe071c
PT continuing 22409 with signal 0
PT stopped PID 22409 0x00000b7f
PT signal 11 at 0x7fb7fe071c for 22409
PT continuing 22409 with signal 11
PT stopped PID 22409 0x0000000b
PT terminating signal 11 at 0x7fb7fe071c for 22409
kcov: Process exited with signal 11 (SIGSEGV) at 0x7fb7fe071c
PT continuing 22409 with signal 0
PT error for 22409: -1
Returning error

Can you disassemble and look what's at 0x7fb7fe071c?

You can also try to limit collection a bit by running with --skip-solibs --include-pattern=build/test.c, to only try to set breakpoints in the test code.

mmlb commented 5 years ago

@vielmetti may be able to get you an aarch64 box if you want to test yourself @SimonKagstrom via worksonarm.

vielmetti commented 5 years ago

Thanks @mmlb .

Are these tests running in a VM or on bare metal?

SimonKagstrom commented 5 years ago

I'm not sure how @kannaiah ran it, but I don't think it really matters if it runs in a (full system) VM or on hardware. A qemu installation should be enough, so I can probably set that up myself sometime, unless someone else does it first :-)

vielmetti commented 5 years ago

I built kcov on a 96-core Cavium ThunderX system under Ubuntu 16.04. I'm not getting any crashes, and I got sensible output when looking at a python3 program.

My minimum reproduction of a problem is this: when I try to trace the system ls, I get no output:

ed@ed-2a-bcc-llvm:~/src/github.com/SimonKagstrom/kcov/build$ kcov ~/tmp/ls ls
kcov: warning: kcov requires binaries built with -g/-ggdb, a build-id file
or GNU debug link information.

^Ckcov: Process exited with signal 2 (SIGINT) at 0x0
SimonKagstrom commented 5 years ago

Well, the system ls doesn't contain debug information, so it can't be traced by kcov. What you can do is to run kcov against itself, i.e.,

kcov /tmp/kcov kcov

although I believe it will behave in the same way as @kannaiah reported above, so it needs debugging.

MrXinWang commented 4 years ago

Hi @SimonKagstrom ! I am wondering if this arm64 support stuff is still on going. Referencing cargo-kcov#40, when I tried to run the coverage test on arm64, a warning popped out like this and the test hanged: kcov: warning: kcov: WARNING: kcov has been built without libbfd-dev (or kcov: binutils-dev), so the --verify option will not do anything.

However I did installed the required library. Using GDB for debugging this issue, it turns out the code hanged in the infinite loop of collector.cc: #4 0x0000aaaaaaac6e64 in Collector::run (this=0xaaaaab04a5a0, filename=...) at /home/michael/go/src/kcov/kcov/src/collector.cc:50

Full backtrace log:

# gdb /usr/local/bin/kcov
Reading symbols from /usr/local/bin/kcov...done.

(gdb) set args ./report_test/ ./build/src/kcov

(gdb) r
Starting program: /usr/local/bin/kcov ./report_test/ ./build/src/kcov
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
[New Thread 0xfffff6b621a0 (LWP 9259)]
^C
Thread 1 "kcov" received signal SIGINT, Interrupt.
0x0000fffff7e0b584 in __GI___nanosleep (requested_time=0xffffffffee68, requested_time@entry=0xffffffffee78, remaining=remaining@entry=0x0)
    at ../sysdeps/unix/sysv/linux/nanosleep.c:28
28      ../sysdeps/unix/sysv/linux/nanosleep.c: No such file or directory.

(gdb) bt
#0  0x0000fffff7e0b584 in __GI___nanosleep (requested_time=0xffffffffee68, requested_time@entry=0xffffffffee78,
    remaining=remaining@entry=0x0) at ../sysdeps/unix/sysv/linux/nanosleep.c:28
#1  0x0000aaaaaaaf50a8 in msleep (ms=ms@entry=1) at /root/xin/kcov/src/utils.cc:38
#2  0x0000aaaaaab07788 in Ptrace::clearBreakpoint (addr=0, this=0xaaaaab04aa60) at /root/xin/kcov/src/engines/ptrace.cc:138
#3  Ptrace::continueExecution (this=0xaaaaab04aa60) at /root/xin/kcov/src/engines/ptrace.cc:302
#4  0x0000aaaaaaac6e64 in Collector::run (this=0xaaaaab04ad50, filename=...) at /root/xin/kcov/src/collector.cc:50
#5  0x0000aaaaaaae4778 in runKcov (runningMode=0) at /root/xin/kcov/src/main.cc:320
#6  0x0000aaaaaaac5664 in main (argc=<optimized out>, argv=<optimized out>) at /root/xin/kcov/src/main.cc:578

Also I checked out the line 92 of CMakeList.txt, it seems that aarch64 is not supported? Got really confused...

Therefore I am wondering if you can give some helps or suggestions. Thanks so much!

SimonKagstrom commented 4 years ago

Well, I've never tried it personally on Aarch64, but there is some code for it. Could you run it with some high value passed to --debug=X?

Clearing a breakpoint still sounds like it's started to do something, but the address is perhaps not quite right here.

michael2012z commented 4 years ago

Hi, @SimonKagstrom, @MrXinWang

I tried /usr/local/bin/kcov --debug=31 ./tmp /usr/bin/ls on an arm64 machine. It kept printing following log before I stopped it:

PT continuing 60 with signal 0 PT stopped PID 60 0x0000057f PT BP at 0x0:-1 for 60 Looking for instruction 0. Can't find breakpoint at 0x0 PT continuing 60 with signal 0 PT stopped PID 60 0x0000057f PT BP at 0x0:-1 for 60 Looking for instruction 0. Can't find breakpoint at 0x0 PT continuing 60 with signal 0 PT stopped PID 60 0x0000057f PT BP at 0x0:-1 for 60 Looking for instruction 0. Can't find breakpoint at 0x0

michael2012z commented 4 years ago

After changing aarch_PC to 32, the program progressed, but finally got blocked here: https://github.com/SimonKagstrom/kcov/blob/master/src/engines/ptrace.cc#L163 . The last messages before falling into the endless waiting was:

PT stopped PID 561 0x0000057f PT BP at 0xffffbf06fd78:-1 for 561 Looking for instruction ffffbf06fd78. REPORT hit at 0xffffbf06fd78 PT continuing 561 with signal 0 PT stopped PID 561 0x0000057f PT BP at 0xffffbf06fd80:-1 for 561 Looking for instruction ffffbf06fd80. REPORT hit at 0xffffbf06fd80 PT continuing 561 with signal 0 PT stopped PID 561 0x0000057f PT BP at 0xffffbf06fd88:-1 for 561 Looking for instruction ffffbf06fd88. REPORT hit at 0xffffbf06fd88 PT continuing 561 with signal 0 PT stopped PID 561 0x0000057f PT BP at 0xffffbf06fd90:-1 for 561 Looking for instruction ffffbf06fd90. Can't find breakpoint at 0xffffbf06fd90 PT continuing 561 with signal 0 PT stopped PID 561 0x0000047f PT BP at 0xffffbf06fd94:-1 for 561 Looking for instruction ffffbf06fd94. REPORT hit at 0xffffbf06fd94 PT continuing 561 with signal 0

MrXinWang commented 4 years ago

Agree with @kannaiah and my colleague @michael2012z about the value of aarch64_PC, should be 32 in kcov/src/engines/ptrace_linux.cc @SimonKagstrom

A comparison of debug logs from arm64 and x86_64 after changing aarch64_PC is attached:

Capture

Seems kcov on arm64 is stuck in endless looking for instructions however never progresses at some point, even if the instruction is hit; and the child process never exits (status: 0x0000057f at first ->0x0000047f in the end on arm64, and 0x0000057f on x86_64).

Looking for instruction fffff7a4f298.
REPORT hit at 0xfffff7a4f298
PT continuing 29350 with signal 0
PT stopped PID 29350 0x0000057f
PT BP at 0xfffff7a4f2a0:-1 for 29350
Looking for instruction fffff7a4f2a0.
Can't find breakpoint at 0xfffff7a4f2a0
PT continuing 29350 with signal 0
PT stopped PID 29350 0x0000047f
PT BP at 0xfffff7a4f2a4:-1 for 29350
Looking for instruction fffff7a4f2a4.
REPORT hit at 0xfffff7a4f2a4
PT continuing 29350 with signal 0
PT stopped PID 29350 0x0000047f
PT BP at 0xfffff7a4f2a4:-1 for 29350
Looking for instruction fffff7a4f2a4.
REPORT hit at 0xfffff7a4f2a4
PT continuing 29350 with signal 0
PT stopped PID 29350 0x0000047f
PT BP at 0xfffff7a4f2a4:-1 for 29350
Looking for instruction fffff7a4f2a4.
REPORT hit at 0xfffff7a4f2a4
PT continuing 29350 with signal 0
SimonKagstrom commented 4 years ago

Thanks a lot for the debugging!

Hmmm... Do you have a toy example to run this on (i.e., a main() which just exits or something similar). Does that work?

Also, you could perhaps try to kill the child process, like

ulimit -c unlimited
kill -SEGV <pid-of-child>

and then debug the core dump in gdb to see if the child get's stuck on some destroyed instruction (since kcov overwrites the actual instructions).

MrXinWang commented 4 years ago

Hi @SimonKagstrom ! Thanks so much for the suggestion!

Following what you have suggested, I built a binary as:

#include "stdio.h"

int main(void)
{

}

then using ./build/src/kcov ./tmp ./test_binary/main to generate the core dump file, the kcov log (I added some log-printing for debugging) in the end after executing kill -SEGV 970 (the pid of the child) is:

-------------------------------single step---------------------------
single step-reg before adjust:0xfffff05eab88
----------------pc before breakpoint, pc=0xfffff7a822a4
----------------adjust pc after breakpoint, pc=0xfffff7a822a4
reg before set:0xfffff05eab88
reg after set:0xfffff05eab88
single step-reg after adjust:0xfffff05eab88
------------------------single step
REPORT hit at 0xfffff7a822a4
PT continuing 970 with signal 0
-----------------------after wait
-------------------------------get pc from regs, pc=0xfffff7a822a4
PT stopped PID 970 0x0000047f
PT BP at 0xfffff7a822a4:-1 for 970
Looking for instruction fffff7a822a4.
-------------------------------single step---------------------------
single step-reg before adjust:0xfffff05eab88
----------------pc before breakpoint, pc=0xfffff7a822a4
----------------adjust pc after breakpoint, pc=0xfffff7a822a4
reg before set:0xfffff05eab88
reg after set:0xfffff05eab88
single step-reg after adjust:0xfffff05eab88
------------------------single step
REPORT hit at 0xfffff7a822a4
PT continuing 970 with signal 0
-----------------------after wait
-------------------------------get pc from regs, pc=0xfffff7a822a4
PT stopped PID 970 0x00000b7f
PT signal 11 at 0xfffff7a822a4 for 970
PT continuing 970 with signal 11
-----------------------after wait
-------------------------------get pc from regs, pc=0x0
PT stopped PID 970 0x0000008b
PT terminating signal 11 at 0xfffff7a822a4 for 970
kcov: Process exited with signal 11 (SIGSEGV) at 0xfffff7a822a4
PT continuing 970 with signal 0
PT error for 970: -1

Using gdb ./test_binary/main ./core, it shows:

Reading symbols from ./test_binary/main...done.
[New LWP 6805]
Core was generated by `./test_binary/main'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000fffff7a822a4 in ?? () from /lib/aarch64-linux-gnu/libc.so.6
(gdb) bt
#0  0x0000fffff7a822a4 in ?? () from /lib/aarch64-linux-gnu/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

while using static build (gcc -c -g main.c -o main.o, gcc -static main.o -o main) to generate the main binary, kcov was not stuck and no core dump file was generated.

Wondering any methods to see/print the instructions modified by kcov?

SimonKagstrom commented 4 years ago

You can run "disassemble" in GDB to see the instructions. However, it seems to crash in glibc, which is somewhat surprising.

I've been putting off setting up a Aarch64 qemu system, but I guess I should get around to it sometime soon :-)

As for the PC adjustment, it depends on how the architecture handles the breakpoints. x86 steps the instruction pointer, so needs adjustment to continue on the instruction where the BP was set (after clearing). At least PPC and ARM does not do this, I'm not sure bout Aarch64.

Also worth checking is if the breakpoint setup replacement (arch_setupBreakpoint, which replaces the instruction with 0xd4200000) is correct. On ARM, breakpoints are really illegal instructions, so kcov uses SIGILL to detect hits. Maybe that's not correct on Aarch64?

MrXinWang commented 4 years ago

Hi @SimonKagstrom ! I think somehow me and my colleague found a method to fix this (for now we don't think the setup breakpoints function will setup all correct breakpoints on arm64, which lead to this strange behaviour that some of the instructions in the beginning will be hit correctly while later some instructions will not be hit). We did a very brief test on our arm64 machine and it works.

However we probably need some time to propose a PR as contributing will need some legal approval.

Thanks so much about the kindly instruction and help with the debugging!

SimonKagstrom commented 4 years ago

Woho! Great work, both of you! Looking forward to a pull request!

Don't thank me, I didn't do any of this work :-)

michael2012z commented 4 years ago

Hi, @SimonKagstrom , do you mind adding 2 files in this repo: LICENSE, CONTRIBUTING.md? Usually open source projects provide them for license affair. With the files, more people will be able to contribute, we also need them to unblock legal restriction. You can find an example here: https://github.com/bazelbuild/bazel

SimonKagstrom commented 4 years ago

@michael2012z the license is present in the COPYING file in the root directory, which was sort of standard when the kcov project started. I don't know enough legal matters to really know how to write a CONTRIBUTING.md file though, perhaps someone else here has a better knowledge about it?

michael2012z commented 4 years ago

Yes, COPYING contains the text for LICENSE. 👍 But still we'd better have some text talking about contributing. Ideally it's a CONTRIBUTING.md file.

MrXinWang commented 4 years ago

Hi @SimonKagstrom ! We are really sorry to make this PR-proposing process complex. It is just because as employees, we are not allowed to (freely) post code changes unless we have the legal approval from the opensource project (probably the company as well, I really dont know....), which probably would involve some other department/colleagues from the company and we need to satisfy their requests......

As far as I am concerned, if we need to apply a legal approval, we need both inbound and outbound license of the project, which are namely the LICENSE/COPYING file and the CONTRIBUTING.md file.

For this project I think any kind of CONTRIBUTING.md file is fine? I think you can easily find a proper template at: https://gist.github.com/PurpleBooth/b24679402957c63ec426 or https://gist.github.com/briandk/3d2e8b3ec8daf5a27a62

Again really sorry for the inconvenience.....we dont want to make things complicated too....

SimonKagstrom commented 4 years ago

I pushed a CONTRIBUTING.md file as well now.