SimonKagstrom / kcov

Code coverage tool for compiled programs, Python and Bash which uses debugging information to collect and report data without special compilation options
http://simonkagstrom.github.io/kcov/
GNU General Public License v2.0
710 stars 109 forks source link

Clear BFD disassembler state between ELF files #392

Closed bossmc closed 1 year ago

bossmc commented 1 year ago

At a high-level, this makes the BFD disassembler forget all state from any previous ELF file that it's parsed when it's setup with a new one. This prevents:

  1. valid addresses from a previous ELF file being spuriously treated as valid in the new one (hopefully doesn't need any more detail)
  2. (in)valid addresses from the new ELF file being "hidden" by section layouts from previous ELFs preventing the addresses from being checked at all (see below for an example)

This "forget all previous state" doesn't feel very user-friendly (it means the caller has to do everything it wants to with one ELF before moving on to the next one) but it's consistent with how the elf-parser code works.

Background

I was hitting https://github.com/SimonKagstrom/kcov/issues/102 when covering some (very simple) rust code on Ubuntu Bionic. The bad instruction was occurring in exactly the place mentioned in that issue, in __pthread_enable_asynccancel from libpthread.so, where the ASM source code looks like:

  54         lock
  55         cmpxchgl %r11d, %fs:CANCELHANDLING
  56         jnz     2b

Note that the lock prefix is on a separate line to the cmpxchgl (even though they're two parts of the same operation) and gcc and clang both helpfully insert separate debug info for those two lines, one pointing at the lock prefix opcode and the other pointing at the rest of the instruction.

The BFD disassembler can correctly see that the cmpxchgl is not a valid location to place an int3 since it's not the start of an opcode, so normally this debug address is rejected as an invalid address (at least with --verify). Despite this I was seeing the cmpxchg being replaced with int3 leading to SIG_ILLs.

What was happening was that a previously inspected SO (possibly the original binary?) had included a section that started after libpthread's .text section, but ended before the offset of the code related to __pthread_enable_asynccancel. When the BFD parser was asked if the cmpxchg address was valid, it attempted to find the section that contained it, but the section searching code assumes (sensibly) that sections can't overlap, so it determined that the only section that could contain the address was the one left over from the earlier ELF. This section didn't extend far enough to cover the address so the BFD parser determined that the searched for address wasn't in a known section, and hence reported it as "valid" (for safety I guess?). In another universe the address might have been found in a section but might have been marked as valid since a previous ELF had a valid instruction starting at that offset which would have led to the same end behaviour.

This seems to be incredibly hard to reproduce minimally (maybe by writing a custom linker script to lay out the sections manually?) and I imagine that the next release of libc/libpthread/rustc/ld/anything would have re-hidden the problem. But better to fix it once and for all now that I've dug into it.

Additional Changes

I also noticed and fixed up two other issues in the BFD parser code:

  1. The list of branch instructions was missing some commas, which is annoyingly valid C/C++ code but meant the basic block detection code would ignore jczx, jezx and loop as branching instructions (instead looking for a jczxjezxloop opcode)
  2. The section lookup code would fail if the address searched for was located in the section with the highest baseAddress (since upper_bound would return end() and the old code would bail out there)
bossmc commented 1 year ago

I also considered (and in fact, mostly implemented, before realising this simpler change) making the BFD parser track sections at their loaded addresses rather than at their paddr offsets. Since the loader will never [citation needed] overlap sections this would allow the parser to load each ELF in any order and to validate addresses/look up basic-blocks from any ELF in any order too without risk of bleed between ELF images.

This change is more complex (not much more but some), and is very dependent on the solib shim being available and in use (otherwise we can't work out the load addresses). It also requires the solib shim reporting more than the relocation base of the main binary (it has to report each segment's load address) which looks like it might be awkward for e.g. static binaries.

If you're interested I can tidy up the prototype and share it with you?

SimonKagstrom commented 1 year ago

Wow, awsome work!

As for the alternative implementation, I think I prefer this one. I regularly use --skip-solibs to avoid the LD_PRELOAD:ed solib parser, as it doesn't work together with ASAN.

I saw that there is a missing include in bfd-disassembler.cc:

/home/runner/work/kcov/kcov/src/parsers/bfd-disassembler.cc: In member function ‘BfdDisassembler::Section* BfdDisassembler::lookupSection(uint64_t)’: /home/runner/work/kcov/kcov/src/parsers/bfd-disassembler.cc:328:36: error: ‘logic_error’ is not a member of ‘std’ 328 | throw std::logic_error("Section search invariant broken");

bossmc commented 1 year ago

Correct, I'd been working on the fix in a docker container (where I had a binary that was reproducing this) and, when I copied the changes out to my git workspace to create this PR, I didn't notice that cmake had failed to find libbfd and thus didn't compile in the bfd-disassembler... I've added the missing header now.

I think the alternative implementation would work with --skip-solibs since it would load the main binary as if it were loaded at address 0x0, but since there are no additional SOs to load, there would be no risk of overlapping sections. OTOH if you're happy with this fix I'm fine not to bring in the additional complexity.

codecov[bot] commented 1 year ago

Codecov Report

Base: 63.97% // Head: 63.97% // Increases project coverage by +0.00% :tada:

Coverage data is based on head (0b8d733) compared to base (59bd329). Patch coverage: 57.14% of modified lines in pull request are covered.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #392 +/- ## ======================================= Coverage 63.97% 63.97% ======================================= Files 58 58 Lines 4627 4633 +6 Branches 4283 4289 +6 ======================================= + Hits 2960 2964 +4 - Misses 1667 1669 +2 ``` | [Impacted Files](https://codecov.io/gh/SimonKagstrom/kcov/pull/392?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=None) | Coverage Δ | | |---|---|---| | [src/parsers/bfd-disassembler.cc](https://codecov.io/gh/SimonKagstrom/kcov/pull/392?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=None#diff-c3JjL3BhcnNlcnMvYmZkLWRpc2Fzc2VtYmxlci5jYw==) | `20.56% <57.14%> (+2.04%)` | :arrow_up: | Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=None). Have a feature suggestion? [Share it here.](https://app.codecov.io/gh/feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=None)

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

SimonKagstrom commented 1 year ago

Thanks a lot for this fix! Very good work and I appreciate the detailed analysis of the problem and your solution!

bossmc commented 1 year ago

No worries, thanks for kcov! Are you planning to cut a new release soon? It would be nice to be able to use an official release if we can.

SimonKagstrom commented 1 year ago

I'll make a release with this feature shortly then!