elfmaster / ecfs

extended core file snapshot format
220 stars 58 forks source link

Frame and other information clobered in ecfs files on specifc crashes #13

Open mothran opened 9 years ago

mothran commented 9 years ago

I was testing a few different possible crashes and I found a very interesting edge case:

#include <stdio.h>

int main(void) {
    asm ("call 0x41414141");
    return 0;
}

compile:

clang -O0 calladdr.c -o calladdr

Then running the binary with the ECFS x64 collector enabled I get the ecfs file, after flipping the core type to CORE and opening it in gdb:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000000041414141 in ?? ()
(gdb) bt
#0  0x0000000041414141 in ?? ()
#1  0x0000000000000000 in ?? ()
(gdb) i frame
Stack level 0, frame at 0x7ffcabbc6d40:
 rip = 0x41414141; saved rip = 0x0
 called by frame at 0x7ffcabbc6d48
 Arglist at 0x7ffcabbc6d30, args: 
 Locals at 0x7ffcabbc6d30, Previous frame's sp is 0x7ffcabbc6d40
 Saved registers:
  rip at 0x7ffcabbc6d38
(gdb) x/16x $rsp
0x7ffcabbc6d38: 0x00000000  0x00000000  0x00000000  0x00000000
0x7ffcabbc6d48: 0x00000000  0x00000000  0x00000000  0x00000000
0x7ffcabbc6d58: 0x00000000  0x00000000  0x00000000  0x00000000
0x7ffcabbc6d68: 0x00000000  0x00000000  0x00000000  0x00000000
(gdb) x/16x $rbp
0x7ffcabbc6d40: 0x00000000  0x00000000  0x00000000  0x00000000
0x7ffcabbc6d50: 0x00000000  0x00000000  0x00000000  0x00000000
0x7ffcabbc6d60: 0x00000000  0x00000000  0x00000000  0x00000000
0x7ffcabbc6d70: 0x00000000  0x00000000  0x00000000  0x00000000
(gdb) i threads 
  Id   Target Id         Frame 
* 1    LWP 12927         0x0000000041414141 in ?? ()

At first I thought this was just a very messed up crash. But then I enabled the regular core pattern:

echo "/tmp/core.%p" > /proc/sys/kernel/core_pattern

then opened the core file (weirdly this standard core file was marked NONE and had to use et_filp):

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000000041414141 in ?? ()
(gdb) bt
#0  0x0000000041414141 in ?? ()
#1  0x00000000004004b0 in ?? ()
#2  0x00000000004004c0 in ?? ()
#3  0x00007f6f383f0610 in ?? ()
#4  0x00007ffffea22ad8 in ?? ()
#5  0x00007ffffea22ad8 in ?? ()
#6  0x0000000100000000 in ?? ()
#7  0x00000000004004a0 in ?? ()
#8  0x0000000000000000 in ?? ()
(gdb) i frame
Stack level 0, frame at 0x7ffffea229f0:
 rip = 0x41414141; saved rip = 0x4004b0
 called by frame at 0x7ffffea229f8
 Arglist at 0x7ffffea229e0, args: 
 Locals at 0x7ffffea229e0, Previous frame's sp is 0x7ffffea229f0
 Saved registers:
  rip at 0x7ffffea229e8
(gdb) x/16x $rsp
0x7ffffea229e8: 0x004004b0  0x00000000  0x004004c0  0x00000000
0x7ffffea229f8: 0x383f0610  0x00007f6f  0xfea22ad8  0x00007fff
0x7ffffea22a08: 0xfea22ad8  0x00007fff  0x00000000  0x00000001
0x7ffffea22a18: 0x004004a0  0x00000000  0x00000000  0x00000000
(gdb) x/16x $rbp
0x7ffffea229f0: 0x004004c0  0x00000000  0x383f0610  0x00007f6f
0x7ffffea22a00: 0xfea22ad8  0x00007fff  0xfea22ad8  0x00007fff
0x7ffffea22a10: 0x00000000  0x00000001  0x004004a0  0x00000000
0x7ffffea22a20: 0x00000000  0x00000000  0x366da9f0  0xf9a9fb79
(gdb) i threads 
  Id   Target Id         Frame 
* 1    LWP 17854         0x0000000041414141 in ?? ()

So it appears that the registers contain address that point to null and this means I am unable to unwind the stack in this case. Thanks.

elfmaster commented 9 years ago

Very nice find. This is exactly the type of bug I am expecting to find in ECFS. I have suspicions about the root cause actually, and I will test as soon as I can.

As far as the non-ecfs core file being marked as ET_NONE? fucking bizzare, did you run 'readelf -S' on it to see if it had section headers? If so, then it WAS still an ECFS file.

Talk to ya soon man

mothran commented 9 years ago
% readelf -S ./core.17854 

There are no sections in this file.

I am damn sure its just a core file, but yeah weird. Even more weird is that I just reran it and it was back to CORE. Must have been something I missed.

elfmaster commented 9 years ago

I just verified in some of my own tests, there are inconsistencies with the stack, and possibly other segments as they are being written to disk. This happens in a particularly complicated part of the code base, but based on some previous issues I was having, I think I have some idea possibly as to what's happening. This may be a major pain in the ass, but worst case I should be able to get it fixed within a week or less. I'm doing some debugging and testing.