[CRASH] Dynamorio: received signal SIGSEGV while using wrappers.

AAANovikov commented 3 years ago

Describe the bug Dynamorio crashes in function wrapper.

To Reproduce Launch dynamorio client with wrappers on example which uses library to process pictures. Dynamorio successfully run without any client.

Expected behavior Successful run dynamorio with wrapper.

Screenshots or Pasted Text Launch Dynamorio with dgb and stop on SIGSEGV:

(gdb) bt
#0  0x00007f527840cae7 in drwrap_after_callee_func (drcontext=0x7f50789c4080, pt=0x7f52787f74c8, mc=0x7f50791dbd90, level=0, retaddr=0x0, unwind=1 '\001', only_requested_unwind=0 '\000')
    at /home/novikov/Student/git/symbolic_executor/dynamorio/ext/drwrap/drwrap.c:2089
#1  0x00007f527840deee in drwrap_in_callee_check_unwind (drcontext=0x7f50789c4080, pt=0x7f52787f74c8, mc=0x7f50791dbd90) at /home/novikov/Student/git/symbolic_executor/dynamorio/ext/drwrap/drwrap.c:2644
#2  0x00007f527840bfbe in drwrap_in_callee (arg1=0x7f52ba1e90e0, xsp=140721126304792) at /home/novikov/Student/git/symbolic_executor/dynamorio/ext/drwrap/drwrap.c:1902
#3  0x00007f52789b8b05 in ?? ()
#4  0x0000000000000008 in ?? ()
#5  0x0000000000000008 in ?? ()
#6  0x00007f5077c0060e in ?? ()
#7  0x00007ffc30beec18 in ?? ()
#8  0x0000000000000008 in ?? ()
#9  0x00007f5077c00600 in ?? ()
#10 0x00000000000001b9 in ?? ()
#11 0x0000000000000000 in ?? ()

when print pt->user_data_pre_cb[level]:

(gdb) p pt->user_data_pre_cb[level]
$2 = (void **) 0x0

Versions Dynamorio version: master, Dec 3 2020. Operating system version: Ubuntu 18.04.5 LTS

AssadHashmi commented 3 years ago

To Reproduce Launch dynamorio client with wrappers on example which uses library to process pictures.

Are you using the sample client wrap.c or your own client which uses drwrap extension? If your own client, can you run your application with the sample wrap client to check that works?

If the sample wrap client fails with your application, please paste the full command line used.

If the sample wrap client passes and your drwrap client fails, please provide more details about your client.

AAANovikov commented 3 years ago

Launching ./dynamorio/bin64/drrun -c ./dynamorio/samples/bin64/libwrap.so:

>
Client wrap is running
<wrapped malloc @0x00007f52f998e4e0
<curiosity: rex.w on OPSZ_6_irex10_short4!>
<(1+x) Handling our fault in a TRY at 0x00007f52f9e40093>
<get_memory_info mismatch! (can happen if os combines entries in /proc/pid/maps)
    os says: 0x00007f50b5955000-0x00007f50b59d8000 prot=0x00000003
    cache says: 0x00007f50b5955000-0x00007f50b5998000 prot=0x00000003
>
program: error while loading shared libraries: libgdcmDSED.so.2.8: cannot create shared object descriptor
<Stopping application program (24091)>
<Largest malloc request: 1425>
<OOM simulations: 1>

AssadHashmi commented 3 years ago

That looks like a failure to load the GDCM imaging library which I presume is something which your imaging application links to?

./dynamorio/bin64/drrun -c ./dynamorio/samples/bin64/libwrap.so

That line doesn't look complete. Is the application command after .../libwrap.so a binary or a shell script which calls the binary?

Try setting your LD_LIBRARY_PATH to the directory which contains libgdcmDSED.so.2.8, e.g. in bash shell: export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/path/to/dir

AAANovikov commented 3 years ago

After setting LD_LIBRARY_PATH, there other missing libraries appear. Program compilation:

g++ -pthread opencv_blur.c -o opencv_blur2 `pkg-config opencv --cflags --libs`

AssadHashmi commented 3 years ago

After setting LD_LIBRARY_PATH, there other missing libraries appear.

Can you set LD_LIBRARY_PATH with the directories of all the missing libraries? Only when the runtime link phase is successful (i.e. all libraries are loaded) can we get any further with debugging. It's possible that the original failure is caused by these library loading errors.

SweetVishnya commented 3 years ago

Hi!

I am trying to create a MWE. I have a multi-threaded hello world application (attached). And run it under drrun, it works:

bin64/drrun -- ./multithread_sum64 1 2 3 4 5 6 7 8 9 8 8 8 8 8 8 8 8 8 8 8

START TID in program tid: 8110
START TID in program tid: 8109
START TID in program tid: 8108
START TID in program tid: 8111
Something
Something
Something
Something
Something
Something
Something
Something
Something
Something
Something
Something
Something
Something
Something
FINISH TID in program tid: 8109
Something
Something
Something
Something
FINISH TID in program tid: 8108
Something
FINISH TID in program tid: 8111
FINISH TID in program tid: 8110

Under wrap.c it leads to error:

bin64/drrun -c samples/bin64/libwrap.so -- ./multithread_sum64 1 2 3 4 5 6 7 8 9 8 8 8 8 8 8 8 8 8 8 8

Client wrap is running
<wrapped malloc @0x00007f8fead816b0
Inconsistency detected by ld.so: rtld.c: 1110: dl_main: Assertion `main_map != NULL' failed!
<Largest malloc request: 1425>
<OOM simulations: 1>

Under debug this works:

bin64/drrun -c samples/bin64/libwrap.so -- ./multithread_sum64 1 2 3 4 5 6 7 8 9 8 8 8 8 8 8 8 8 8 8 8

<(1+x) Handling our fault in a TRY at 0x00007f6aba65d093>
Client wrap is running
<wrapped malloc @0x00007f6aba1a96b0
<curiosity: rex.w on OPSZ_6_irex10_short4!>
<wrapped malloc @0x00007f68b95c7140
<get_memory_info mismatch! (can happen if os combines entries in /proc/pid/maps)
    os says: 0x00007f68af529000-0x00007f68b7529000 prot=0x00000000
    cache says: 0x00007f68af529000-0x00007f68b752a000 prot=0x00000000
>
START TID in program tid: 14040
START TID in program tid: 14043
START TID in program tid: 14041
START TID in program tid: 14042
Something
Something
Something
Something
Something
Something
Something
Something
Something
Something
Something
Something
Something
Something
Something
Something
Something
Something
Something
FINISH TID in program tid: 14041
FINISH TID in program tid: 14040
Something
FINISH TID in program tid: 14043
FINISH TID in program tid: 14042
<Stopping application /old/fwork/sydr/build-debug/tests/synthetic/bin64/multithread_sum64 (14039)>
<Largest malloc request: 4232>
<OOM simulations: 0>

multithread_sum.tar.gz

derekbruening commented 3 years ago

Inconsistency detected by ld.so: rtld.c: 1110: dl_main: Assertion `main_map != NULL' failed!

Did you look at what the wrap sample does? It deliberately fails a heap allocation. So this result is exactly what you would expect! I don't see any surprises here.

SweetVishnya commented 3 years ago

But the debug output differs from release output.

SweetVishnya commented 3 years ago

The thing is that I am trying to reproduce this error on wrap sample. We get this error in our tool when we wrap free function and run multithread_sum64 under our client.

(gdb) p pt->user_data_pre_cb[level]
$2 = (void **) 0x0

derekbruening commented 3 years ago

But the debug output differs from release output.

Again, look at the wrap.c code: it randomly fails an allocation, and as you can see, it says "OOM simulations: 0" in that particular run: so clearly the random condition never triggered. Also, this is an issue about a particular crash with drwrap, which this, even if a bug, does not match: so if it were a bug please file a separate issue.

derekbruening commented 3 years ago

The thing is that I am trying to reproduce this error on wrap sample. We get this error in our tool when we wrap free function and run multithread_sum64 under our client.
(gdb) p pt->user_data_pre_cb[level]
$2 = (void **) 0x0

I see, did not realize you were the OP. I would say, probably not good to use the wrap sample out of the box as a comparison point, due to its OOM injection (maybe there should be a sample that only observes; or it could be tweaked locally).

That code in drwrap is handling a frame skip from longjmp or SEH unwind or somesuch. I'm wondering:

What caused the frame skip?
What is the value of level?

SweetVishnya commented 3 years ago

The value of level is zero.

AAANovikov commented 3 years ago

Setting LD_LIBRARY_PATH for all dirs from ldd doesn't help. It also load shared libraries with errors. On some launches, when libraries are loaded successfully, it still get SIGSEGV:

Client wrap is running
<wrapped malloc @0x00007f21eb1bf4e0
<curiosity: rex.w on OPSZ_6_irex10_short4!>
<(1+x) Handling our fault in a TRY at 0x00007f21eb671093>
<get_memory_info mismatch! (can happen if os combines entries in /proc/pid/maps)
    os says: 0x00007f1fa7186000-0x00007f1fa7209000 prot=0x00000003
    cache says: 0x00007f1fa7186000-0x00007f1fa71c9000 prot=0x00000003
>
<wrapped malloc @0x00007f21e94a40e0
<Application opencv_blur (25131).  DynamoRIO Sample Client 'wrap' internal crash at PC 0x00007f21a741a4e0.  Please report this at http://dynamorio.org/issues.  Program aborted.
Received SIGSEGV at pc 0x00007f21a741a4e0 in thread 25131
Base: 0x00007f21eb3cf000
Registers:eax=0x0000000000000000 ebx=0x0000000000000008 ecx=0x0000000000000000 edx=0x00007f21a71d2fa0
    esi=0x00007f21e94a40e0 edi=0x00007f21a76204e0 esp=0x00007f1fa727bc00 ebp=0x00007f1fa727bcf0
    r8 =0x0000000000000002 r9 =0x0000000000000000 r10=0x00007f21e73cf010 r11=0x0000000000000000
    r12=0x00007ffcc250d960 r13=0x00000000000001b9 r14=0x00007f1f92b4f4e0 r15=0x0000000000000003
    eflags=0x0000000000010246
version 8.0.18600, custom build

It looks like the problem is not with loading shared libraries with errors.

Compiling file:

g++ -pthread opencv_blur.c -o opencv_blur -lopencv_imgcodecs -lopencv_imgproc -lopencv_core

SweetVishnya commented 3 years ago

I reproduced an error with the attached client:

bin64/drrun -c samples/bin64/libwrap.so -- ./multithread_sum64 1 2 3 4 5 6 7 8 9 8 8 8 8 8 8 8 8 8 8 8

Client wrap is running
<wrapped malloc @0x00007f62667996b0
<wrapped malloc @0x00007f6065bb7140
START TID in program tid: 2833
START TID in program tid: 2834
START TID in program tid: 2836
START TID in program tid: 2835
Something
Something
Something
Something
Something
Something
Something
Something
Something
Something
Something
Something
Something
Something
Something
Something
Something
Something
FINISH TID in program tid: 2834
Something
FINISH TID in program tid: 2836
Something
FINISH TID in program tid: 2835
FINISH TID in program tid: 2833
<Application /old/fwork/sydr/build/tests/synthetic/bin64/multithread_sum64 (2832).  DynamoRIO Sample Client 'wrap' internal crash at PC 0x00000000740035d3.  Please report this at http://dynamorio.org/issues.  Program aborted.
Received SIGSEGV at pc 0x00000000740035d3 in thread 2836
Base: 0x00007f62669ab000
Registers:eax=0x0000000000000000 ebx=0x00000000471964a0 ecx=0x0000000000000001 edx=0x0000000000000000
    esi=0x000000000000003f edi=0x0000000000000000 esp=0x00007f606693c370 ebp=0x0000000000000001
    r8 =0x0000000000000000 r9 =0x0000000072000d20 r10=0x00007f6050000000 r11=0x0000000000000000
    r12=0x0000000000000000 r13=0x0000000000000000 r14=0x00000000470e8cb8 r15=0x0000000000000000
    eflags=0x0000000000010202
version 8.0.18600, custom build
-no_dynamic_options -client_lib '/old/fwork/sydr/build/dynamorio/samples/bin64/libwrap.so;0;' -client_lib64 '/old/fwork/sydr/build/dynamorio/samples/bin64/libwrap.so;0;' -code_api -stack_size 56K -signal_stack_size 32K -max_elide_jmp 0 -max_elide_call 0 -early_inject -emulate_brk -no_inline_ignored_syscalls -native_exec_>

wrap.tar.gz

SweetVishnya commented 3 years ago

Backtrace:

#0  0x00000000740074e0 in drwrap_after_callee_func (drcontext=0x7fc517961e40, pt=0x44c7e4b0, 
    mc=0x7fc51794dd90, level=0, retaddr=0x0, unwind=1 '\001', only_requested_unwind=0 '\000')
    at /old/fwork/sydr/dynamorio/ext/drwrap/drwrap.c:2089
#1  0x00000000740088e7 in drwrap_in_callee_check_unwind (drcontext=0x7fc517961e40, 
    pt=0x44c7e4b0, mc=0x7fc51794dd90) at /old/fwork/sydr/dynamorio/ext/drwrap/drwrap.c:2644
#2  0x00000000740069b7 in drwrap_in_callee (arg1=0x7fc516be8a30, xsp=140484449447640)
    at /old/fwork/sydr/dynamorio/ext/drwrap/drwrap.c:1902
#3  0x0000000044c4e33d in ?? ()
#4  0x0000000000000000 in ?? ()

derekbruening commented 3 years ago

An application backtrace would be good, or it might require a little app history: the same question as above: what caused the frame skip (to trip the stack pointer check leading to calling drwrap_after_callee_func from drwrap_in_callee_check_unwind)? Or maybe this would be enough: what are pt->wrap_level, pt->app_esp[pt->wrap_level], and mc->xsp inside drwrap_in_callee_check_unwind? Although if drwrap_after_callee_func was already called in that loop we'd want to figure out what pt->wrap_level started as and how many iters it went through the loop.

SweetVishnya commented 3 years ago

(gdb) up
#1  0x00000000740088e7 in drwrap_in_callee_check_unwind (drcontext=0x7f642a069e00, pt=0x4d6414b0, mc=0x7f642a08fd90) at /old/fwork/sydr/dynamorio/ext/drwrap/drwrap.c:2644
(gdb) p/x pt->wrap_level
$4 = 0x0
(gdb) p/x pt->app_esp[pt->wrap_level]
$5 = 0x7f6428262ea8
(gdb) p/x mc->xsp
$6 = 0x7f6428262ed8
(gdb) bt
#0  0x00000000740074e0 in drwrap_after_callee_func (drcontext=0x7f642a069e00, pt=0x4d6414b0, mc=0x7f642a08fd90, level=0, retaddr=0x0, unwind=1 '\001', only_requested_unwind=0 '\000')
    at /old/fwork/sydr/dynamorio/ext/drwrap/drwrap.c:2089
#1  0x00000000740088e7 in drwrap_in_callee_check_unwind (drcontext=0x7f642a069e00, pt=0x4d6414b0, mc=0x7f642a08fd90) at /old/fwork/sydr/dynamorio/ext/drwrap/drwrap.c:2644
#2  0x00000000740069b7 in drwrap_in_callee (arg1=0x7f6429300a30, xsp=140068147048152) at /old/fwork/sydr/dynamorio/ext/drwrap/drwrap.c:1902
#3  0x000000004d60133d in ?? ()
#4  0x0000000000000000 in ?? ()

SweetVishnya commented 3 years ago

Thread 1 "multithread_sum" received signal SIGUSR2, User defined signal 2.
0x000000004e9bfc42 in ?? ()
(gdb) c
Continuing.

Thread 2 "multithread_sum" received signal SIGUSR2, User defined signal 2.
[Switching to LWP 25069]
syscall_ready () at /old/fwork/sydr/dynamorio/core/drlibc/drlibc_x86.asm:184
184         pop      REG_XBX
(gdb) c
Continuing.

Thread 4 "multithread_sum" received signal SIGUSR2, User defined signal 2.
[Switching to LWP 25071]
0x00007f22de166747 in encode_operand (di=0x7f20dde88480, optype=12, opsize=16 '\020', opnd=...) at /old/fwork/sydr/dynamorio/core/ir/x86/encode.c:2307
2307                CLIENT_ASSERT(di->seg_override >= REG_START_SEGMENT &&
(gdb) c
Continuing.

Thread 5 "multithread_sum" received signal SIGUSR2, User defined signal 2.
[Switching to LWP 25072]
syscall_ready () at /old/fwork/sydr/dynamorio/core/drlibc/drlibc_x86.asm:184
184         pop      REG_XBX
(gdb) c
Continuing.

Thread 2 "multithread_sum" received signal SIGUSR2, User defined signal 2.
[Switching to LWP 25069]
syscall_ready () at /old/fwork/sydr/dynamorio/core/drlibc/drlibc_x86.asm:184
184         pop      REG_XBX
(gdb) 
Continuing.

Thread 4 "multithread_sum" received signal SIGUSR2, User defined signal 2.
[Switching to LWP 25071]
syscall_ready () at /old/fwork/sydr/dynamorio/core/drlibc/drlibc_x86.asm:184
184         pop      REG_XBX
(gdb) 
Continuing.

Thread 5 "multithread_sum" received signal SIGUSR2, User defined signal 2.
[Switching to LWP 25072]
syscall_ready () at /old/fwork/sydr/dynamorio/core/drlibc/drlibc_x86.asm:184
184         pop      REG_XBX
(gdb) 
Continuing.

Thread 3 "multithread_sum" received signal SIGSEGV, Segmentation fault.
[Switching to LWP 25070]
0x00000000740074e0 in drwrap_after_callee_func (drcontext=0x7f20ddd2e000, pt=0x4ea584b0, mc=0x7f20dde63d90, level=0, retaddr=0x0, unwind=1 '\001', only_requested_unwind=0 '\000')
    at /old/fwork/sydr/dynamorio/ext/drwrap/drwrap.c:2089
2089                    if (wrap->pre_cb == pt->user_data_pre_cb[level][idx] &&

SweetVishnya commented 3 years ago

Complete logs

multithread_sum64.25218.00000000.tar.gz

derekbruening commented 3 years ago

(gdb) p/x pt->wrap_level
$4 = 0x0
(gdb) p/x pt->app_esp[pt->wrap_level]
$5 = 0x7f6428262ea8
(gdb) p/x mc->xsp
$6 = 0x7f6428262ed8

pt->wrap_level is modified in the loop though: but likely this is the first time through?

Repeating the prior questions about the skip: so it enters a wrapped function, and before it hits the post-cb for that function, it enters another wrapped function, yet the stack pointer has already unrolled: so it missed the post-cb. Was it longjmp?

What is pt->user_data_count[level]? How is it possible for the count to be set, but the pt->user_data_pre_cb[level] to not be allocated? The drwrap_free_user_data() doesn't set count to 0: I would add a set to 0 there, but that may just mask a wrap pairing/rollback bug. I would try to enumerate the precise sequence of pre and post events and app events.

SweetVishnya commented 3 years ago

I start my vacation tomorrow. I may be able to return to it in January. Can we make this issue "help wanted" for now?

derekbruening commented 3 years ago

Sure, thanks for the help so far.

bx commented 3 years ago

I have also been able to trigger this bug (segfault at drwrap.c:2089 from trying to dereference a NULL pointer) while/after wrapping munmap. I haven't isolated the relevant client code, I have a few observations that are hopefully helpful:

the instrumented/wrapped application is single threaded
the bug is triggered when using cronbuild-8.0.18649 but is NOT triggered when using release_7.1.0-1

derekbruening commented 3 years ago

the bug is triggered when using cronbuild-8.0.18649 but is NOT triggered when using release_7.1.0-1

So it's a regression? Would it be possible to do a binary search or git bisect to zero in on exactly what change triggers it?

SweetVishnya commented 3 years ago

I reproduced the same bug with a single-threaded openssl binary:

untar the attached archive (logs included).
Copy wrap.c to api/samples/wrap.c. The provided client wraps memcpy function.
Build DynamoRIO.

Run wrap client inside untared directory:

cd openssl-wrap-memcpy-crash
/dynamorio/bin64/drrun -loglevel 3 -c /dynamorio/samples/bin64/libwrap.so -- ./openssl openssl_input

The backtrace is the same as above.

(gdb) p pt->user_data_pre_cb[level]
$2 = (void **) 0x0

https://nextcloud.ispras.ru/s/AgPjRPwEWGi5rkB

SweetVishnya commented 3 years ago

We performed a bisection. this commit breaks it.

derekbruening commented 3 years ago

Looking at the code, the most likely problem is the move of pt->wrap_level++ above drwrap_ensure_postcall? And the only problem there is if the latter does a flush and does not return and the redirected control comes back in and does the pt->wrap_level++ again and so it gets too high?

SweetVishnya commented 3 years ago

(gdb) up
#1  0x00000000740088e7 in drwrap_in_callee_check_unwind (drcontext=0x7f642a069e00, pt=0x4d6414b0, mc=0x7f642a08fd90) at /old/fwork/sydr/dynamorio/ext/drwrap/drwrap.c:2644
(gdb) p/x pt->wrap_level
$4 = 0x0
(gdb) p/x pt->app_esp[pt->wrap_level]
$5 = 0x7f6428262ea8
(gdb) p/x mc->xsp
$6 = 0x7f6428262ed8
(gdb) bt
#0  0x00000000740074e0 in drwrap_after_callee_func (drcontext=0x7f642a069e00, pt=0x4d6414b0, mc=0x7f642a08fd90, level=0, retaddr=0x0, unwind=1 '\001', only_requested_unwind=0 '\000')
    at /old/fwork/sydr/dynamorio/ext/drwrap/drwrap.c:2089
#1  0x00000000740088e7 in drwrap_in_callee_check_unwind (drcontext=0x7f642a069e00, pt=0x4d6414b0, mc=0x7f642a08fd90) at /old/fwork/sydr/dynamorio/ext/drwrap/drwrap.c:2644
#2  0x00000000740069b7 in drwrap_in_callee (arg1=0x7f6429300a30, xsp=140068147048152) at /old/fwork/sydr/dynamorio/ext/drwrap/drwrap.c:1902
#3  0x000000004d60133d in ?? ()
#4  0x0000000000000000 in ?? ()

pt->wrap_level is zero

derekbruening commented 3 years ago

pt->wrap_level is zero

Because of the extra increment and so now the loop walking backward hits an empty level at 0 where nothing was filled in due to the redirect?

DynamoRIO / dynamorio

[CRASH] Dynamorio: received signal SIGSEGV while using wrappers. #4607