NVlabs / NVBit

200 stars 18 forks source link

Segmentation Fault in Nvbit::compute_max_stack_size #28

Closed crozhon closed 3 years ago

crozhon commented 3 years ago

I'm trying to use NVBit to profile an application. I obtain a Segmentation Fault after the first call to cudaMemcpyToSymbol. It seems that nvbit_at_init() and nvbit_at_cuda_event() are being called. I also tried CUDA_INJECTION64_PATH instead of LD_PRELOAD.

LD_PRELOAD=./mem_trace.so ./pbrt --gpu --pixel 1,1 ~/Downloads/pbrt-v4-scenes/smoke-plume/plume.pbrt 
pbrt version 4 (built Oct  3 2020 at 16:49:56)
Copyright (c)1998-2020 Matt Pharr, Wenzel Jakob, and Greg Humphreys.
The source code to pbrt (but *not* the book contents) is covered by the BSD License.
See the file LICENSE.txt for the conditions of the license.
------------- NVBit (NVidia Binary Instrumentation Tool v1.4) Loaded --------------
NVBit core environment variables (mostly for nvbit-devs):
            NVDISASM = nvdisasm - override default nvdisasm found in PATH
            NOBANNER = 0 - if set, does not print this banner
---------------------------------------------------------------------------------
         INSTR_BEGIN = 0 - Beginning of the instruction interval where to apply instrumentation
           INSTR_END = 4294967295 - End of the instruction interval where to apply instrumentation
        TOOL_VERBOSE = 0 - Enable verbosity inside the tool
----------------------------------------------------------------------------------------------------
Segmentation fault (core dumped)

Here's a stack trace from cuda-gdb. It seems there's a recursive loop of sorts in compute_max_stack_size? Any ideas why this might be?

#0  0x00007ffff7ed1059 in Nvbit::compute_max_stack_size(Function*)
    () from ./mem_printf.so
#1  0x00007ffff7ed1083 in Nvbit::compute_max_stack_size(Function*)
    () from ./mem_printf.so
#2  0x00007ffff7ed1083 in Nvbit::compute_max_stack_size(Function*)
    () from ./mem_printf.so
#3  0x00007ffff7ed1083 in Nvbit::compute_max_stack_size(Function*)
    () from ./mem_printf.so
#4  0x00007ffff7ed1083 in Nvbit::compute_max_stack_size(Function*)
    () from ./mem_printf.so
#5  0x00007ffff7ed1083 in Nvbit::compute_max_stack_size(Function*)
    () from ./mem_printf.so
#6  0x00007ffff7ed1083 in Nvbit::compute_max_stack_size(Function*)
    () from ./mem_printf.so
#7  0x00007ffff7ed1083 in Nvbit::compute_max_stack_size(Function*)
    () from ./mem_printf.so
#8  0x00007ffff7ed1083 in Nvbit::compute_max_stack_size(Function*)
    () from ./mem_printf.so
#9  0x00007ffff7ed1083 in Nvbit::compute_max_stack_size(Function*)
    () from ./mem_printf.so
[snip]
#174500 0x00007ffff7ed1083 in Nvbit::compute_max_stack_size(Function*) () from ./mem_printf.so
#174501 0x00007ffff7edd81d in Nvbit::module_loaded(CUctx_st*, void const*, unsigned long, CUmod_st*) () from ./mem_printf.so
#174502 0x00007ffff7eddf1a in nvbitToolsCallbackFunc(void*, CUtools_cb_domain_enum, unsigned int, void const*) () from ./mem_printf.so
#174503 0x00007ffff6d1aef3 in ?? () from /usr/lib/libcuda.so.1
#174504 0x00007ffff6b61bcd in ?? () from /usr/lib/libcuda.so.1
#174505 0x00007ffff6a96698 in ?? () from /usr/lib/libcuda.so.1
#174506 0x00007ffff6a96dfc in ?? () from /usr/lib/libcuda.so.1
#174507 0x0000555555b46c04 in cudart::contextState::loadCubin(bool*, cudart::globalModule*) ()
#174508 0x0000555555b3c34e in cudart::globalModule::loadIntoContext(cudart::contextState*) ()
#174509 0x0000555555b4d324 in cudart::contextState::applyChanges() ()
#174510 0x0000555555b51aea in cudart::contextStateManager::initRuntimeContextState_nonreentrant(cudart::contextState**) ()
#174511 0x0000555555b51d84 in cudart::contextStateManager::getRuntimeContextState(cudart::contextState**, bool) ()
#174512 0x0000555555b32140 in cudart::cudaApiMemcpyToSymbol(void const*, void const*, unsigned long, unsigned long, cudaMemcpyKind) ()
#174513 0x0000555555b6dfa0 in cudaMemcpyToSymbol ()
#174514 0x00005555558babe0 in pbrt::InitLogging(pbrt::LogConfig, bool) ()
#174515 0x00005555557efb57 in pbrt::InitPBRT(pbrt::PBRTOptions const&) ()
#174516 0x000055555566ffe3 in main ()

System Configuration: nvcc: release 11.0, V11.0.194 Driver Version: 450.57 CUDA Version: 11.0

I confirmed everything works properly with the vectoradd example, so I don't think it's an issue with my system configuration. Does anyone have any insight into what's going on here?

ovilla commented 3 years ago

We need a little bit more information to try to see what is going on. Which GPU is it? Is there a way you can point us to the exact same version of the application? Thanks,

crozhon commented 3 years ago

Thanks for responding so quickly. This is with an RTX2080Ti, SM7.5.

The application is pbrtv4 from the latest master (ea9e5fdef6), which is available here on github and is pretty easy to build. All you to build is OPTIX and it uses cmake. It's a ray-tracer that's been adapted from CPU-code, so some of the kernels look a bit nasty.

I was able to isolate a specific set of kernels as the issue. When you comment out the contents of EvaluateMaterialAndBSDF specified in src/pbrt/gpu/surfscatter.cpp (https://github.com/mmp/pbrt-v4/blob/master/src/pbrt/gpu/surfscatter.cpp), the problem is eliminated and I'm able to instrument the application as expected. So it seems related to the lambda specified by that function. I can try and come up with a smaller self-contained example if this isn't enough to go on.

ovilla commented 3 years ago

I was able to reproduce on my side and I will try to work on it this week. Thanks for pointing this out!

ovilla commented 3 years ago

The issue should be resolved in NVBit version 1.5 (just released). Please let us know if it works for you. Thanks again for reporting.

crozhon commented 3 years ago

Forgot to comment, but this worked perfectly. Thanks so much for your effort on it.