`cleanup()` with RTC builds segfault in libcuda.so

ptheywood commented 1 year ago

A python user encountered segfaults within a code which calls cleanup many times, when FLAME GPU used is used within a larger iterative process.

This appears to be due to RTC destruction after the device has been reset, with a segfault occuring during cuModuleUnload within Jitify's destroy_streams.

Now this is narrowed down, we should be able to reproduce this, either as a c++ RTC test, and then ideally as a jitify MWE that we could report upstream?

For now, the fix is to just not call cleanup() for RTC Simulations?

Robadob commented 1 year ago

Full RelWithDebug stack trace, from a build of the v2.0.0-rc tag.

#0  0x00002aaac2bd6cb5 in ?? () from /lib64/libcuda.so.1
#1  0x00002aaac2ca4cb5 in ?? () from /lib64/libcuda.so.1
#2  0x00002aaabe6046f3 in destroy_module (this=0x2e03fe0, this=0x2e03fe0)
    at /home/ac1rch/FLAMEGPU2/build_w_debug/_deps/jitify-src/jitify/jitify.hpp:1215
#3  ~CUDAKernel (this=0x2e03fe0, __in_chrg=<optimized out>)
    at /home/ac1rch/FLAMEGPU2/build_w_debug/_deps/jitify-src/jitify/jitify.hpp:1300
#4  operator() (this=0x30350c0, __ptr=0x2e03fe0)
    at /usr/local/packages/dev/gcc/8.2.0/include/c++/8.2.0/bits/unique_ptr.h:81
#5  ~unique_ptr (this=0x30350c0, __in_chrg=<optimized out>)
    at /usr/local/packages/dev/gcc/8.2.0/include/c++/8.2.0/bits/unique_ptr.h:274
#6  ~KernelInstantiation (this=0x30350c0, __in_chrg=<optimized out>)
    at /home/ac1rch/FLAMEGPU2/build_w_debug/_deps/jitify-src/jitify/jitify.hpp:4157
#7  operator() (this=0x6705250, __ptr=0x30350c0)
    at /usr/local/packages/dev/gcc/8.2.0/include/c++/8.2.0/bits/unique_ptr.h:81
#8  ~unique_ptr (this=0x6705250, __in_chrg=<optimized out>)
    at /usr/local/packages/dev/gcc/8.2.0/include/c++/8.2.0/bits/unique_ptr.h:274
#9  ~pair (this=0x6705230, __in_chrg=<optimized out>)
    at /usr/local/packages/dev/gcc/8.2.0/include/c++/8.2.0/bits/stl_pair.h:198
---Type <return> to continue, or q <return> to quit---
#10 destroy<std::pair<std::__cxx11::basic_string<char> const, std::unique_ptr<jitify::experimental::KernelInstantiation> > > (this=0x1c9d4b8, __p=0x6705230)
    at /usr/local/packages/dev/gcc/8.2.0/include/c++/8.2.0/ext/new_allocator.h:140
#11 destroy<std::pair<std::__cxx11::basic_string<char> const, std::unique_ptr<jitify::experimental::KernelInstantiation> > > (__a=..., __p=0x6705230)
    at /usr/local/packages/dev/gcc/8.2.0/include/c++/8.2.0/bits/alloc_traits.h:487
#12 _M_destroy_node (this=0x1c9d4b8, __p=0x6705210)
    at /usr/local/packages/dev/gcc/8.2.0/include/c++/8.2.0/bits/stl_tree.h:661
#13 _M_drop_node (this=0x1c9d4b8, __p=0x6705210)
    at /usr/local/packages/dev/gcc/8.2.0/include/c++/8.2.0/bits/stl_tree.h:669
#14 std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unique_ptr<jitify::experimental::KernelInstantiation, std::default_delete<jitify::experimental::KernelInstantiation> > >, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unique_ptr<jitify::experimental::KernelInstantiation, std::default_delete<jitify::experimental::KernelInstantiation> > > >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unique_ptr<jitify::experimental::KernelInstantiation, std::default_delete<jitify::experimental---Type <return> to continue, or q <return> to quit---
::KernelInstantiation> > > > >::_M_erase (this=this@entry=0x1c9d4b8,
    __x=0x6705210)
    at /usr/local/packages/dev/gcc/8.2.0/include/c++/8.2.0/bits/stl_tree.h:1874
#15 0x00002aaabe604abe in ~_Rb_tree (this=0x1c9d4b8, __in_chrg=<optimized out>)
    at /home/ac1rch/FLAMEGPU2/include/flamegpu/simulation/detail/CUDAAgent.h:33
#16 ~map (this=0x1c9d4b8, __in_chrg=<optimized out>)
    at /usr/local/packages/dev/gcc/8.2.0/include/c++/8.2.0/bits/stl_map.h:300
#17 ~CUDAAgent (this=0x1c9d450, __in_chrg=<optimized out>)
    at /home/ac1rch/FLAMEGPU2/include/flamegpu/simulation/detail/CUDAAgent.h:33
#18 flamegpu::detail::CUDAAgent::~CUDAAgent (this=0x1c9d450,
    __in_chrg=<optimized out>)
    at /home/ac1rch/FLAMEGPU2/include/flamegpu/simulation/detail/CUDAAgent.h:33
#19 0x00002aaabe638d72 in operator() (this=0x122b438, __ptr=<optimized out>)
    at /usr/local/packages/dev/gcc/8.2.0/include/c++/8.2.0/bits/hashtable.h:2045
#20 ~unique_ptr (this=0x122b438, __in_chrg=<optimized out>)
    at /usr/local/packages/dev/gcc/8.2.0/include/c++/8.2.0/bits/unique_ptr.h:274
#21 ~pair (this=0x122b418, __in_chrg=<optimized out>)
    at /usr/local/packages/dev/gcc/8.2.0/include/c++/8.2.0/bits/stl_pair.h:198
#22 destroy<std::pair<std::__cxx11::basic_string<char> const, std::unique_ptr<flamegpu::detail::CUDAAgent> > > (this=<optimized out>, __p=0x122b418)
    at /usr/local/packages/dev/gcc/8.2.0/include/c++/8.2.0/ext/new_allocator.h:1---Type <return> to continue, or q <return> to quit---
40
#23 destroy<std::pair<std::__cxx11::basic_string<char> const, std::unique_ptr<flamegpu::detail::CUDAAgent> > > (__a=..., __p=0x122b418)
    at /usr/local/packages/dev/gcc/8.2.0/include/c++/8.2.0/bits/alloc_traits.h:487
#24 _M_deallocate_node (this=<optimized out>, __n=0x122b410)
    at /usr/local/packages/dev/gcc/8.2.0/include/c++/8.2.0/bits/hashtable_policy.h:2100
#25 _M_deallocate_nodes (this=0x2866b70, __n=0x0)
    at /usr/local/packages/dev/gcc/8.2.0/include/c++/8.2.0/bits/hashtable_policy.h:2113
#26 std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unique_ptr<flamegpu::detail::CUDAAgent, std::default_delete<flamegpu::detail::CUDAAgent> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unique_ptr<flamegpu::detail::CUDAAgent, std::default_delete<flamegpu::detail::CUDAAgent> > > >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::clear (this=this@entry=0x2866b70)
---Type <return> to continue, or q <return> to quit---
    at /usr/local/packages/dev/gcc/8.2.0/include/c++/8.2.0/bits/hashtable.h:2047
#27 0x00002aaabe62df77 in clear (this=0x2866b70)
    at /home/ac1rch/FLAMEGPU2/src/flamegpu/simulation/CUDASimulation.cu:197
#28 flamegpu::CUDASimulation::~CUDASimulation (this=0x2866a00,
    __in_chrg=<optimized out>)
    at /home/ac1rch/FLAMEGPU2/src/flamegpu/simulation/CUDASimulation.cu:197
#29 0x00002aaabe62e269 in flamegpu::CUDASimulation::~CUDASimulation (
    this=0x2866a00, __in_chrg=<optimized out>)
    at /home/ac1rch/FLAMEGPU2/src/flamegpu/simulation/CUDASimulation.cu:180
#30 0x00002aaabe2b6557 in _wrap_delete_CUDASimulation ()
    at /home/ac1rch/FLAMEGPU2/build_w_debug/swig/python/pyflamegpu/flamegpuPYTHON_wrap.cxx:90012
#31 0x00002aaabe2584de in SwigPyObject_dealloc (v=0x2aaabdf6ffc0)
    at /home/ac1rch/FLAMEGPU2/build_w_debug/swig/python/pyflamegpu/flamegpuPYTHON_wrap.cxx:1581
#32 0x000000000055f468 in subtype_dealloc ()
#33 0x0000000000522fd8 in _PyFrame_Clear ()
#34 0x0000000000512210 in _PyEval_EvalFrameDefault ()
#35 0x00000000005cc57e in _PyEval_Vector ()
#36 0x00000000005cbb9f in PyEval_EvalCode ()
#37 0x00000000005ed7b7 in run_eval_code_obj ()
#38 0x00000000005e9dd0 in run_mod ()

ptheywood commented 1 year ago

Turns out I did write some pyflamegpu tests for cleanup, which call cleanup prior to simulation and ensemble dtor, with an agent function (which means RTC). Presumably this doesn't trigger the issue due to GC non determinism or similar (or even non-determinism within libcuda.so's UB). The tests explciitly call cleanup prior to destrcuction of a CUDASimulation object within the same scope, so it should be triggering the dtors after the cleanup method.

From the above stack trace, its the dtor of the CUDAAgent, and I think destruction of the member CUDARTCFuncMap, a map containing unique_ptr<jitify::experimental::KernelInstantiation> which leads to the segfault.

A disgusting "fix" would be to explicitly release the unique pointers in the map on a custom CUDAAgent dtor if the context is no longer valid (would need to test this somehow, see #1056 as an example of the gross way this might need doing, though it might be cleaner to just allocate 1 byte in the current context, and do the valid pointer check on it).

I.e. we intentionally leak the entire jitify object to avoid the offending cuda call leading to a segfault,

Cleaner fix would be to explicitly check the context / cumodule is still valid within jitify. They are not checking the error code of the cuModuleDestory method, implying they might be expecting it to error in some cases, but the segfault instead is the underlying problem. Can't find a method to explicitly cehck if a module is loaded in the current context. Potentially cuModuleGetFunction could be used, expecting it to error if the module is bad (CUDA_ERROR_NOT_FOUND returned) but there's no gurantee this wouldn't trigger the same segfault/UB.

cuModuleUnload documentation states it unloads the module from the current context, and its undefined behaviour to use the handle after the function call, but nothing about UB if the module is not loaded in the current context (though presumably that is the case triggering the issue).

Alternatlively, we could have a more stateful/state-aware cleanup method, which knows what objects need deleting prior to device reset, though this won't prevent any issues from a user adding explicit device resets themselves (which should not be encouraged, but cannot be prevented), and will require a bunch of additional changes to track everything (though maybe we could jsut do CUDAsimulation and CUDAEnsembles, as everything else should be a child of those due to device/context selection.

ptheywood commented 1 year ago

addiung a version of TestCleanup.CUDASimulation with a single RTC member function does not reproduce the segfault either, with CUDA 11.2 on my local machine.

The Simulation Dtor is defianateley being called after the cleanup/devicereset/context destruction, and implicilty this should be affter the CUDAAgent destruction (via the destruciton of the CUDAAgentMap member variable, which should hold a unique ptr to each CUDAagent for the lifetime of the CUDASimulation object (post init)).

Given the original segfault was non-deterministic / requires many runs, the segfault might only occur with some other condition met (i.e. a large number of contexts being created and destroyed, or a large number of agent funcitons?)

Robadob commented 1 year ago

I could potentially modify the erroring case to make it seeded, and see whether that makes it deterministic. All that is changing in each case is the body of an agent function, so potentially it's the (compiled?) size of the kernel that matters. It would make sense that a larger allocation may have a higher chance of upsetting some memory.

ptheywood commented 1 year ago

Using the same kernel 1000 times, destorying the context and then destroying the simulation(s) each time doesn't trigger the issue either (it just takes a very long time to run, 271s), so could be related to the kernel size as expected, or when some other condition is met.

As it's in libcuda.so, it could also be driver version dependent so not reproducing in 530.30.02 anyway even when using CUDA 11.2.

@Robadob had reproduced this on the K80 node which is runnign a 470.X.Y driver (as the last supported kepler driver(s), though the original user who encountered this issue was likely on a much more modern driver (though windows iirc, so potentially we are chasing a differnt issue as well).

ptheywood commented 1 year ago

The TestCleanup.CUDASimulationRTC test in the cleanup-rtc branch does not produce the segfault when executed on the K80 node in ShARC, using CUDA 11.2 (and driver 470.182.3), so need to keep investigating reproducing this.

Mutating the test to perform 300 calls to cleanup/dtor did not reproduce the fault either.

FLAMEGPU / FLAMEGPU2

`cleanup()` with RTC builds segfault in libcuda.so #1061