NVlabs / NVBit

224 stars 21 forks source link

Segmentation Faults due to accessing deallocated structures in Nvbit #89

Open hanhwi opened 2 years ago

hanhwi commented 2 years ago

Nvbit 1.5.5 (under Nvidia driver 510.47.03 and CUDA 11.3) generates segmentation faults in Nvbit::module_unloading function occasionally. Nvbit()::~Nvbit() destructor is the cause of the fault.

The followings are the steps I verified the problem.

I tested AccelSim's tracer_tool and several tools on bfs and b+-tree of rodinia-3.1 benchmark (https://github.com/accel-sim/gpu-app-collection) and got the following segmentation faults.

Thread 1 "bfs-rodinia-3.1" hit Breakpoint 1, 0x00007f93503e8ec0 in Nvbit::module_unloading(CUctx_st*, CUmod_st*) () from /work/tracer_nvbit/tracer_tool/tracer_tool.so
$17 = 13
$18 = 0
$19 = 0x557606f26700
"v=>"$20 = 0x230
$21 = 0x0
$22 = 0x0
$23 = 0x0
$24 = 0x0
$25 = 0x0
$26 = 0x0
$27 = 0x0
$28 = 0x0
$29 = 0x0
$30 = 0x0
$31 = 0x0
$32 = 0x0

Thread 1 "bfs-rodinia-3.1" received signal SIGSEGV, Segmentation fault.
0x00007f93503e8f60 in Nvbit::module_unloading(CUctx_st*, CUmod_st*) () from /work/tracer_nvbit/tracer_tool/tracer_tool.so
A debugging session is active.

The segmentation fault happened while accessing the internal hash-like structure, which uses CUmod_st* as a key according to disassembled instructions. I printed the values in the structure with GDB, prefixed with $19-$32. The entry below $19 shows the value $19 entry contains, which causes the segmentation fault (accessing 0x230). You can see that the structure has been corrupted.

To dig into the cause, I tested it with valgrind memcheck. For simplicity, I used instr_count tool example. This is the result tailored for clarity.

==55509== Invalid read of size 8
==55509==    at 0x6AE4721: Nvbit::module_unloading(CUctx_st*, CUmod_st*) 
==55509==    by 0x6AECA23: nvbitToolsCallbackFunc(void*, CUtools_cb_domain_enum, unsigned int, void const*) 
...
==55509==  Address 0x9021da0 is 48 bytes inside a block of size 104 free'd
==55509==    at 0x483CFBF: operator delete(void*) 
==55509==    by 0x6ADFE4D: Nvbit::~Nvbit() 
==55509==    by 0x4B498D6: __run_exit_handlers (exit.c:108)
==55509==    by 0x4B49A8F: exit (exit.c:139)
==55509==    by 0x4B270B9: (below main) (libc-start.c:342)
==55509==  Block was alloc'd at
==55509==    at 0x483BE63: operator new(unsigned long) (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==55509==    by 0x6AE32CE: Nvbit::create_ctx(CUctx_st*) (in /work/tracer_nvbit/nvbit_release/tools/instr_count/instr_count.so)
==55509==    by 0x6AECA57: nvbitToolsCallbackFunc(void*, CUtools_cb_domain_enum, unsigned int, void const*) (in /work/tracer_nvbit/nvbit_release/tools/instr_count/instr_count.so)
....

You can verify that Nvbit destructor was called before module_unloading function which depended on Nvbit class. I have no clue why Nvbit class has been destructed before unloading modules and whether this problem is specific to Nvbit 1.5.5.

ovilla commented 2 years ago

Sorry for missing this earlier, thanks for reporting... we will try to reproduce.