DynamoRIO / dynamorio

Dynamic Instrumentation Tool Platform
Other
2.64k stars 560 forks source link

CRASH (f86115ad python) core/unix/module.c:101 assert #4904

Open varun729 opened 3 years ago

varun729 commented 3 years ago

Describe the bug Running DynamoRIO with python application that uses Tensorflow results in a crash. DynamoRIO can successfully run with a trivial helloworld python program, and even with a python application that uses PyTorch.

The goal is to use dynamorio with drcachesim to collect memory trace, but it crashes. This is similar to the issue #2929 and the suggestion there was to try debugging without any DynamoRIO client. So, providing details here from the debug run.

To Reproduce Steps to reproduce the behavior:

  1. Run the python 3.7 with tensorflow 1.15 installed in Anaconda environment.
  2. Precise command line for running the application.
    drrun -debug -- python -c "import tensorflow as tf"
  3. Exact output or incorrect behavior. debug_drio_python_tensorflow.txt

Please also answer these questions:

Expected behavior Program should run to completion when run without any client.

Versions

abhinav92003 commented 3 years ago

Extracting the assert failure from attached logs for easy access:

Application /home/user/.conda/envs/tf1.15/bin/python3.7 (4911).  Internal Error: DynamoRIO debug check failure: /home/user/DynamoRioTop/dynamorio/core/unix/module.c:101 ma->os_data.num_segments > 0 && ma->os_data.segments != NULL

I also see a bunch of curiosities:

<CURIOSITY : elf_hdr->e_phoff != 0 && elf_hdr->e_phoff + elf_hdr->e_phnum * elf_hdr->e_phentsize <= view_size in file /home/user/DynamoRioTop/dynamorio/core/unix/module_elf.c line 363

<CURIOSITY : found_load && mod_base != (app_pc)(0x7fffffffffffffffLL * 2ULL + 1ULL) && max_end != (app_pc)0 in file /home/user/DynamoRioTop/dynamorio/core/unix/module_elf.c line 434

<CURIOSITY : max_end > mod_base in file /home/user/DynamoRioTop/dynamorio/core/unix/module_elf.c line 436
derekbruening commented 3 years ago

What are the procfs maps entries for these mappings that DR is complaining about? Is it an unusual ELF file and we just need to relax DR's assumptions? Are these fatal: does DR run fine in release build?

varun729 commented 3 years ago

I'm not sure if you are referring to this:

7f0247f34000-7f0247f42000 rw-p 00000000 00:00 0

The above entry is corresponding to these mappings:

0x00007f0247f413e0 0x00000000710e482c
0x00007f0247f41630 0x00000000712f8cbd
0x00007f0247f416c0 0x00000000711c3557
0x00007f0247f41720 0x00000000711c3b8b
0x00007f0247f41770 0x00000000712d8cbc
0x00007f0247f418a0 0x00000000712d8d53
0x00007f0247f41900 0x00000000711a65d5
0x00007f0247f41a90 0x000000007126aa06
0x00007f0247f41ad0 0x0000000071272c38
0x00007f0247f41d30 0x000000007127b750
0x00007f0247f41f00 0x00000000710ccc26
0x00007f0247f41ff0 0x00007f0447ea2187

Attached is the proc/maps for the process right before it crashed (in debug mode): maps_minimal_example.txt

Running DR release also crashes. Below is the output:

<Application /home/user/.conda/envs/tf1.15/bin/python3.7 (31319).  DynamoRIO internal crash at PC 0x00007fd48275e5fa.  Please report this at http://dynamorio.org/issues/.  Program aborted.
Received SIGSEGV at pc 0x00007fd48275e5fa in thread 31319
Base: 0x00007fd48263b000
Registers:eax=0x0000000000000000 ebx=0x00007fd23e37f978 ecx=0x00007fd23e37f988 edx=0x0000000000000000
        esi=0x00007fd23e37f980 edi=0x0000000007c18000 esp=0x00007fd23e37f950 ebp=0x00007fd23e7e5f90
        r8 =0x00007fd23e37f978 r9 =0x0000000000000000 r10=0x0000000000000000 r11=0x0000000000000001
        r12=0x00007fd21b72f000 r13=0x00007fd23e7e5ee0 r14=0x00007fd23e37f988 r15=0x00007fd23e37f980
        eflags=0x0000000000010246
version 8.0.18758, custom build
-no_dynamic_options -code_api -stack_size 56K -signal_stack_size 32K -max_elide_jmp 0 -max_elide_call 0 -early_inject -emulate_brk -no_inline_ignored_syscalls -native_exec_default_list '' -no_native_exec_managed_code -no_indcall2direct 
0x00007fd23e7e5f90 0x00007fd222eb3000
0x00007fd21b72f000 0x0000000000000000>
derekbruening commented 3 years ago

I'm not sure if you are referring to this:

No, it looks like you are looking at a DR callstack's frame pointers at the assert point or something? That is not a point of interest here. All these asserts and curiosities are complaining about one of the application's mappings that contains an ELF header. The question is, what is the deal with these application mappings that DR is tripping over with asserts and curiosities?

varun729 commented 3 years ago

@derekbruening After looking at the log files, I find that the problematic shared library is libmklml_intel.so.

I added some log prints to the module_elf.c and found the ELF header values:

module_walk_program_headers: elf_hdr->e_phoff: 0x7c18000
module_walk_program_headers: elf_hdr->e_phnum: 0x9
module_walk_program_headers: elf_hdr->e_phentsize: 0x38
module_walk_program_headers: view_size: 0x0000000007784000
...
SYSLOG_WARNING: CURIOSITY : elf_hdr->e_phoff != 0 && elf_hdr->e_phoff + elf_hdr->e_phnum * elf_hdr->e_phentsize <= view_size in file /home/user/DynamoRioTop/dynamorio/core/unix/module_elf.c line 365
....
....
module_walk_program_headers: found_load: 0
module_walk_program_headers: mod_base: 0x0000000000000000
module_walk_program_headers: max_end: 0x0000000000000000
SYSLOG_WARNING: CURIOSITY : found_load && mod_base != (app_pc)(0x7fffffffffffffffLL * 2ULL + 1ULL) && max_end != (app_pc)0 in file /home/user/DynamoRioTop/dynamorio/core/unix/module_elf.c line 437

Do you need any other information that can help in debug?

petrochenkov commented 3 years ago

I observed the same curiosity (followed by a hard assert) on binaries produced by https://github.com/facebookincubator/BOLT.

Program header (Phdr) in that case is in a different memory region (different segment?) from the ELF header (Ehdr), but DynamoRIO incorrectly assumes that they are always in the same region in memquery_library_bounds_by_iterator_internal -> module_walk_program_headers.

varun729 commented 3 years ago

@petrochenkov thanks! I now understand the curiosity warning and the assert that follows is because the program header was not found. @derekbruening Is it possible to relax the constraint that program header has to be within the same memory region?

Further, I now understand why I didn't encounter this problem a few months ago when I run the same application on a different machine. The library libmklml_intel.so was an older version where the phoff=0x40, so the assumption made in DynamoRIO was satisfied.

derekbruening commented 3 years ago

@derekbruening Is it possible to relax the constraint that program header has to be within the same memory region?

Yes, ideally DR would not make any assumptions. (I believe there are some other known bad assumptions that could be cleaned up too, like all ELF segments having the same alignment.) I don't know what code has to be changed to relax this particular assumption -- if you are able to, please investigate and submit a PR.

varun729 commented 3 years ago

@derekbruening I should be able to spend some time early next week.

varun729 commented 3 years ago

@derekbruening want to know if I'm on the correct path.

Below are the mappings for the library. Typically, the program headers are in the executable mapping of a library. For this ELF file, the program headers are at an offset of 0x7c18000, so they are part of the 5th mapping. In order to read the program headers we need to find out where the 5th mapping is loaded in memory and I'm trying to find out how to get that information in the DynamoRIO source code.

2ad788638000-2ad78fdbc000 r-xp 00000000 00:ff 3018275816                 /home/user/.conda/envs/tf_1.15/lib/python3.7/site-packages/_solib_k8/_U@mkl_Ulinux_S_S_Cmkl_Ulibs_Ulinux___Uexternal_Smkl_Ulinux_Slib/libmklml_intel.so
2ad78fdbc000-2ad78ffbc000 ---p 07784000 00:ff 3018275816                 /home/user/.conda/envs/tf_1.15/lib/python3.7/site-packages/_solib_k8/_U@mkl_Ulinux_S_S_Cmkl_Ulibs_Ulinux___Uexternal_Smkl_Ulinux_Slib/libmklml_intel.so
2ad78ffbc000-2ad78ffdf000 r--p 07784000 00:ff 3018275816                 /home/user/.conda/envs/tf_1.15/lib/python3.7/site-packages/_solib_k8/_U@mkl_Ulinux_S_S_Cmkl_Ulibs_Ulinux___Uexternal_Smkl_Ulinux_Slib/libmklml_intel.so
2ad78ffdf000-2ad7900a4000 rw-p 077a7000 00:ff 3018275816                 /home/user/.conda/envs/tf_1.15/lib/python3.7/site-packages/_solib_k8/_U@mkl_Ulinux_S_S_Cmkl_Ulibs_Ulinux___Uexternal_Smkl_Ulinux_Slib/libmklml_intel.so
2ad7900ca000-2ad790136000 rw-p 07c18000 00:ff 3018275816                 /home/user/.conda/envs/tf_1.15/lib/python3.7/site-packages/_solib_k8/_U@mkl_Ulinux_S_S_Cmkl_Ulibs_Ulinux___Uexternal_Smkl_Ulinux_Slib/libmklml_intel.so
derekbruening commented 3 years ago

Looking at this more closely: this is actually non-trivial to solve. It seems possible for the program headers to not be in a loaded segment at all; or for the first loaded segment's file offset to not be 0, in which case DR's code blindly using the file offset as a virtual offset will fail; or as here to be in a different segment, but we can't identify the segments without walking the program headers. The solution is heavyweight: go read the file on disk. We have to do this today for the section headers (for handling RSEQ) which for most ELF files are not loaded; see https://github.com/DynamoRIO/dynamorio/blob/master/core/unix/rseq_linux.c#L530.

Then there's what to do if the file is not available: it could have been unlinked after mmapping.

I suppose we have to hope all of these corner cases remain rare and just live with the overhead of the file access for these cases -- at least for DR attaching mid-run. For DR launching the app: DR could record the virt offset at mmap time (and conceivably could look for unlink and make a copy of the program headers at that point: if the unlink is done in-process; or make a copy up front just in case). (Wondering whether ld.so handles these corner cases for dl_iterate_phdr.)

shizhanglei1 commented 2 years ago

I also encountered the same problem. Do you have a solution?

derekbruening commented 2 years ago

I also encountered the same problem. Do you have a solution?

If you would like to help by contributing code, my prior comment https://github.com/DynamoRIO/dynamorio/issues/4904#issuecomment-848437479 tries to lay out the issues and solutions. For non-attach, as I proposed it may be best for DR to record the program header virtual offset at the file's first mmap time.

chaitanyaupp18 commented 8 months ago

I observed the same curiosity (followed by a hard assert) on binaries produced by https://github.com/facebookincubator/BOLT.

Program header (Phdr) in that case is in a different memory region (different segment?) from the ELF header (Ehdr), but DynamoRIO incorrectly assumes that they are always in the same region in memquery_library_bounds_by_iterator_internal -> module_walk_program_headers.

Just to help users who want to use DynamoRIO on binaries produced by https://github.com/facebookincubator/BOLT, run BOLT with -use-gnu-stack option like this, $llvm-bolt -use-gnu-stack "other arguments you want to pass to llvm-bolt" "binary" -o "optimized-binary" while producing the binary that you want to use with DynamoRIO.