Nvbit 1.5.5 (under Nvidia driver 510.47.03 and CUDA 11.3) generates segmentation faults in Nvbit::module_unloading function occasionally. Nvbit()::~Nvbit() destructor is the cause of the fault.
The followings are the steps I verified the problem.
Thread 1 "bfs-rodinia-3.1" hit Breakpoint 1, 0x00007f93503e8ec0 in Nvbit::module_unloading(CUctx_st*, CUmod_st*) () from /work/tracer_nvbit/tracer_tool/tracer_tool.so
$17 = 13
$18 = 0
$19 = 0x557606f26700
"v=>"$20 = 0x230
$21 = 0x0
$22 = 0x0
$23 = 0x0
$24 = 0x0
$25 = 0x0
$26 = 0x0
$27 = 0x0
$28 = 0x0
$29 = 0x0
$30 = 0x0
$31 = 0x0
$32 = 0x0
Thread 1 "bfs-rodinia-3.1" received signal SIGSEGV, Segmentation fault.
0x00007f93503e8f60 in Nvbit::module_unloading(CUctx_st*, CUmod_st*) () from /work/tracer_nvbit/tracer_tool/tracer_tool.so
A debugging session is active.
The segmentation fault happened while accessing the internal hash-like structure, which uses CUmod_st* as a key according to disassembled instructions. I printed the values in the structure with GDB, prefixed with $19-$32. The entry below $19 shows the value $19 entry contains, which causes the segmentation fault (accessing 0x230). You can see that the structure has been corrupted.
To dig into the cause, I tested it with valgrind memcheck. For simplicity, I used instr_count tool example.
This is the result tailored for clarity.
==55509== Invalid read of size 8
==55509== at 0x6AE4721: Nvbit::module_unloading(CUctx_st*, CUmod_st*)
==55509== by 0x6AECA23: nvbitToolsCallbackFunc(void*, CUtools_cb_domain_enum, unsigned int, void const*)
...
==55509== Address 0x9021da0 is 48 bytes inside a block of size 104 free'd
==55509== at 0x483CFBF: operator delete(void*)
==55509== by 0x6ADFE4D: Nvbit::~Nvbit()
==55509== by 0x4B498D6: __run_exit_handlers (exit.c:108)
==55509== by 0x4B49A8F: exit (exit.c:139)
==55509== by 0x4B270B9: (below main) (libc-start.c:342)
==55509== Block was alloc'd at
==55509== at 0x483BE63: operator new(unsigned long) (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==55509== by 0x6AE32CE: Nvbit::create_ctx(CUctx_st*) (in /work/tracer_nvbit/nvbit_release/tools/instr_count/instr_count.so)
==55509== by 0x6AECA57: nvbitToolsCallbackFunc(void*, CUtools_cb_domain_enum, unsigned int, void const*) (in /work/tracer_nvbit/nvbit_release/tools/instr_count/instr_count.so)
....
You can verify that Nvbit destructor was called before module_unloading function which depended on Nvbit class.
I have no clue why Nvbit class has been destructed before unloading modules and whether this problem is specific to Nvbit 1.5.5.
Nvbit 1.5.5 (under Nvidia driver 510.47.03 and CUDA 11.3) generates segmentation faults in Nvbit::module_unloading function occasionally. Nvbit()::~Nvbit() destructor is the cause of the fault.
The followings are the steps I verified the problem.
I tested AccelSim's tracer_tool and several tools on bfs and b+-tree of rodinia-3.1 benchmark (https://github.com/accel-sim/gpu-app-collection) and got the following segmentation faults.
The segmentation fault happened while accessing the internal hash-like structure, which uses CUmod_st* as a key according to disassembled instructions. I printed the values in the structure with GDB, prefixed with $19-$32. The entry below $19 shows the value $19 entry contains, which causes the segmentation fault (accessing 0x230). You can see that the structure has been corrupted.
To dig into the cause, I tested it with valgrind
memcheck.
For simplicity, I used instr_count tool example. This is the result tailored for clarity.You can verify that Nvbit destructor was called before module_unloading function which depended on Nvbit class. I have no clue why Nvbit class has been destructed before unloading modules and whether this problem is specific to Nvbit 1.5.5.