Closed ptheywood closed 1 year ago
I've not tried to reproduce it with visualisation enabled. That was a different issue preventing compilation.
Mixed up discussions from earlier. Repro'd without vis under linux and updated the above.
Narrowed it down to these three tests causing the CUDAError to be thrown. Removing any of the 3 stops the error being raised.
python3 -m pytest ../tests/swig/python/gpu/test_cuda_simulation.py ../tests/swig/python/io/test_logging.py ../tests/swig/python/io/test_logging_exceptions.py -vv -k "test_set_get_population_data or test_CUDAEnsembleSimulate"
===================================== test session starts =====================================
platform linux -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /home/ptheywood/code/flamegpu/FLAMEGPU2/build/lib/Debug/python/venv/bin/python3
cachedir: .pytest_cache
rootdir: /home/ptheywood/code/flamegpu/FLAMEGPU2
collected 28 items / 25 deselected / 3 selected
../tests/swig/python/gpu/test_cuda_simulation.py::TestSimulation::test_set_get_population_data PASSED [ 33%]
../tests/swig/python/gpu/test_cuda_simulation.py::TestSimulation::test_set_get_population_data_invalid_cuda_agent PASSED [ 66%]
../tests/swig/python/io/test_logging.py::LoggingTest::test_CUDAEnsembleSimulate PASSED [100%]
============================== 3 passed, 25 deselected in 0.55s ===============================
terminate called after throwing an instance of 'flamegpu::exception::CUDAError'
what(): /home/ptheywood/code/flamegpu/FLAMEGPU2/include/flamegpu/gpu/detail/CUDAErrorChecking.cuh(28): CUDA Error: /home/ptheywood/code/flamegpu/FLAMEGPU2/src/flamegpu/runtime/utility/EnvironmentManager.cu(149): invalid argument
Aborted (core dumped)
cuda-gdb locks up pretty hard when debugging locally (might be better on a device not running my screen), so resorted to some printf
in
EnvironmentManager::~EnvironmentManager
:
fprintf(stdout, "EnvironmentManager::~EnvironmentManager %p h_buffer %p d_buffer %p\n", this, h_buffer, d_buffer); fflush(stdout);
Which then outputs
../tests/swig/python/gpu/test_cuda_simulation.py::TestSimulation::test_set_get_population_data PASSED [ 33%]EnvironmentManager::~EnvironmentManager 0x55f741d77b90 h_buffer 0x55f7420b3b70 d_buffer 0x7f3075000000
../tests/swig/python/gpu/test_cuda_simulation.py::TestSimulation::test_set_get_population_data_invalid_cuda_agent PASSED [ 66%]
../tests/swig/python/io/test_logging.py::LoggingTest::test_CUDAEnsembleSimulate PASSED [100%]
============================== 3 passed, 25 deselected in 0.54s ===============================
EnvironmentManager::~EnvironmentManager 0x55f74277be20 h_buffer 0x55f7420d4930 d_buffer 0x7f3075000000
terminate called after throwing an instance of 'flamegpu::exception::CUDAError'
what(): /home/ptheywood/code/flamegpu/FLAMEGPU2/include/flamegpu/gpu/detail/CUDAErrorChecking.cuh(28): CUDA Error: /home/ptheywood/code/flamegpu/FLAMEGPU2/src/flamegpu/runtime/utility/EnvironmentManager.cu(153): invalid argument
EnvironmentManager::~EnvironmentManager 0x55edd7e48d00 h_buffer 0x55edd774eb60 d_buffer 0x7f50a3000000
I.e. the dtor is called on different instances of EnvironmentManager, both of which had teh same d_buffer
but different h_buffer
, so it's a double free.
ADding printf's to EnvironmentManager::init
which sets the value of h_buffer
didn't do anything which doesn't make much sense.
After a bisect, 631b2b036970ca74bd3bab616ee17ab0e32113f0 is the first bad commit in terms of pytest emitting CUDA/Python errors.
26ed242 is therefore the last good commit.
The offending commit changed device reset behaviour, which lines up with cuda errors being emitted when freeing device pointers.
I'm leaning towrads the root cause being the handling of python callbacks though, rather than the newer device reset logic being wrong. We think this may be related to #498.
I've implemented some wrappers around cudaFree
, cudaFreeHost
and manual alternate error handling for cudaStreamDestroy
as a potential workaround for this.
E.g. if the pointer is not valid for the current context don't attempt to free it. The stream logic is not quite as nice but as close as I can get.
This has been pushed to the wrapped-cuda-free
branch, but no PR for now as I'm not convinced this is the final solution.
It does tempermentally result in a segfault within libcuda.so
, which I can't get any useful debug info for, so I wouldn't merge this any time soon.
In general though I am leaning towards this being an issue with the device reset logic.
Individual simulations use a class-scoped atomic to track the number of CUDASimulation objects, and only resets the deice when this is 0.
In cases where there are CUDASimulation
instances on different devices, this will only ever free theh device for the last destroyed simulation.
The CUDASimulation resetting also is disabled for the c++ test suite. This looks to have been made for performance reasons (context creation is relatively time consuming, and for 1000 tests that adds up). It is not disable for the python test suite, so may be a factor still.
CUDAEnsemble::simulate
resets all devices involved in the ensemble, but it doesn't check to see if there are any active CUDASimulation
instances, so will break any simualtions which exist at the same time.
// Create and initialise a simulation, to ensure there is device stuff involved
CUDASimulation sim = ...
sim.step();
// Create an ensemble and simualte it
CUDAEnsemble ens = ...
ense.simulate(); // This resets all devices invovled
sim.step(); // This will probably die miserably, as the cuda context has been reset
The above is not an intended use case, but is possible, and effectivly what is happening with the delayed CUDASimulation destruction due to python GC.
I've written a C++ test which uses the above pattern, and triggered the CUDAError
seen in the python test suite, so the GC distraction was a bit of a red herring that distracted us from the root cause (even though I found it via the bisect, but didn't spot the actual issue at the time).
I've pushed this to the bad-device-reset
branch, but will discuss this prior to implementing a fix.
./bin/Debug/tests --gtest_filter="TestCUDAEnsemble.ResetWithCUDASimulationInstance"
Running main() from /home/ptheywood/code/flamegpu/FLAMEGPU2/tests/helpers/main.cu
Note: Google Test filter = TestCUDAEnsemble.ResetWithCUDASimulationInstance
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from TestCUDAEnsemble
[ RUN ] TestCUDAEnsemble.ResetWithCUDASimulationInstance
/home/ptheywood/code/flamegpu/FLAMEGPU2/include/flamegpu/gpu/detail/CUDAErrorChecking.cuh(28): CUDA Error: /home/ptheywood/code/flamegpu/FLAMEGPU2/src/flamegpu/runtime/utility/EnvironmentManager.cu(168): invalid device context
/home/ptheywood/code/flamegpu/FLAMEGPU2/tests/test_cases/gpu/test_cuda_ensemble.cu:305: Failure
Expected: simulation.step() doesn't throw an exception.
Actual: it throws flamegpu::exception::CUDAError with description "/home/ptheywood/code/flamegpu/FLAMEGPU2/include/flamegpu/gpu/detail/CUDAErrorChecking.cuh(28): CUDA Error: /home/ptheywood/code/flamegpu/FLAMEGPU2/src/flamegpu/runtime/utility/EnvironmentManager.cu(168): invalid device context".
/home/ptheywood/code/flamegpu/FLAMEGPU2/include/flamegpu/gpu/detail/CUDAErrorChecking.cuh(28): CUDA Error: /home/ptheywood/code/flamegpu/FLAMEGPU2/src/flamegpu/exception/FLAMEGPUDeviceException.cu(17): invalid argument
terminate called after throwing an instance of 'flamegpu::exception::CUDAError'
what(): /home/ptheywood/code/flamegpu/FLAMEGPU2/include/flamegpu/gpu/detail/CUDAErrorChecking.cuh(28): CUDA Error: /home/ptheywood/code/flamegpu/FLAMEGPU2/src/flamegpu/exception/FLAMEGPUDeviceException.cu(17): invalid argument
Aborted (core dumped)
Without the EXPECT_NO_THROW
, letting the dtor run on scope change also triggers the exception and abort.
When combined with the wrapping from the other branch, it replicates the observed segfault.
We need to re-think the device reset logic, and implement tests for it including these sorts of edge cases. We won't hit them all but atleast a few will help prevent this.
After more thought I'm leaning towards us not doing automatic cudaDeviceReset
s within the library, as it mutates global state which is out of FLAMEGPU's scope (e.g. if a user has some CUDA memory allocated already, it would be ruined by some automatic, always-on part of a library, which feels like poor form).
We reset for 3 reasons really:
We can ensure profiling data is flushed using other CUDA API CAlls (which might have been driver API only origianlly, so we learned a device reset was the way to do it).
This would be a cudaDeviceSynchronize()
and a cudaProfilerStop
for each active device / cuda context (so we may need to use the driver equivalent for models with RTC, if they use more than one / a separate context).
If this is not done immediately prior to a exit
, we would probably want to re-enable profiling via cudaProfilerStart
(as stop is more than just a flush).
This might not want to be automatic, as the start/stop methods might be required if someone wishes to only profile a small part of a long running application (starting is implicit when profiling, unless profiler CLI args are used). We could either control this via a CMake option, or make it a user-callable method (which we would want to add to every binary we produce and the templates)? We woudl lose device information once the ensemble / sim is dtor'd though, so it would probably need to be a method on the appropraite object?
Accurate leak checking via cuda-memcheck etc requires a context destruction (so deviec reset in runtime API).
3.8. Leak Checking The memcheck tool can detect leaks of allocated memory.
Memory leaks are device side allocations that have not been freed by the time the context is destroyed. The memcheck tool tracks device memory allocations created using the CUDA driver or runtime APIs. Starting in CUDA 5, allocations that are created dynamically on the device heap by calling malloc() inside a kernel are also tracked.
For an accurate leak checking summary to be generated, the application's CUDA context must be destroyed at the end. This can be done explicitly by calling cuCtxDestroy() in applications using the CUDA driver API, or by calling cudaDeviceReset() in applications programmed against the CUDA run time API.
The --leak-check full option must be specified to enable leak checking.
Again though, this doesn't need to be as early, but would need to reset all devices which had been touched. With this in place though it will find any bad free/malloc pairs.
In terms of making sure we haven't left any global state in a bad way, we'd just need to make sure we don't do that. IIRC there is only one global device symbol still defined, so otherwise it should just be making sure all allocations are deallocated at the correct time
On balance of factors the final proposed solution is safe exit function to do device resets.
@Robadob found that the python test suite would exit with a
Fatal Python error: Aborted
using the current master branch a76da8ca75a543513f5ea6c108a52da6f0c70afc.I've reproduced this under linux, for a release builds with and without visualisation enabled.
As rob found, the reported error in
io/test_logging_exceptions.py
is fine in isolation. Theio
suite on its own is OK.Running the
gpu
andio
suites together triggers the error too:I've managed to trigger a
CUDAError
by narrowing the test down further, totest_cuda_simulation.py
&test_logging.py
.Runnint
test_cuda_simulation
andio
it reproduces the original error message.Looks like
EnvironmentManager::~EnvironmentManager()
is related somehow, could be cuda in dtor related?.Excerpt from #873 permalink
Have now built and ran test suite for each of the 3 curand engine's consecutively. All worked as expected.
When running pyflamegpu test suite, it crashes with a fatal error (see end of this comment). However, that error is also currently present in the master pyflamegpu (windows/release) suite too. Hence, I consider this PR ready.
This error only occurs running full suite, the individual test_logging_exceptions.py set of tests run fine in isolation.
io\test_logging_exceptions.py ..Fatal Python error: Aborted
Current thread 0x00007aa8 (most recent call first): File "C:\Users\Robadob\fgpu2\build2\lib\Release\python\venv\lib\site-packages_pytest\capture.py", line 780 in pytest_runtest_setup File "C:\Users\Robadob\fgpu2\build2\lib\Release\python\venv\lib\site-packages\pluggy_callers.py", line 34 in _multicall File "C:\Users\Robadob\fgpu2\build2\lib\Release\python\venv\lib\site-packages\pluggy_manager.py", line 80 in _hookexec File "C:\Users\Robadob\fgpu2\build2\lib\Release\python\venv\lib\site-packages\pluggy_hooks.py", line 265 in call File "C:\Users\Robadob\fgpu2\build2\lib\Release\python\venv\lib\site-packages_pytest\runner.py", line 259 in
File "C:\Users\Robadob\fgpu2\build2\lib\Release\python\venv\lib\site-packages_pytest\runner.py", line 338 in from_call
File "C:\Users\Robadob\fgpu2\build2\lib\Release\python\venv\lib\site-packages_pytest\runner.py", line 258 in call_runtest_hook
File "C:\Users\Robadob\fgpu2\build2\lib\Release\python\venv\lib\site-packages_pytest\runner.py", line 219 in call_and_report
File "C:\Users\Robadob\fgpu2\build2\lib\Release\python\venv\lib\site-packages_pytest\runner.py", line 124 in runtestprotocol
File "C:\Users\Robadob\fgpu2\build2\lib\Release\python\venv\lib\site-packages_pytest\runner.py", line 111 in pytest_runtest_protocol
File "C:\Users\Robadob\fgpu2\build2\lib\Release\python\venv\lib\site-packages\pluggy_callers.py", line 39 in _multicall
File "C:\Users\Robadob\fgpu2\build2\lib\Release\python\venv\lib\site-packages\pluggy_manager.py", line 80 in _hookexec
File "C:\Users\Robadob\fgpu2\build2\lib\Release\python\venv\lib\site-packages\pluggy_hooks.py", line 265 in call
File "C:\Users\Robadob\fgpu2\build2\lib\Release\python\venv\lib\site-packages_pytest\main.py", line 347 in pytest_runtestloop
File "C:\Users\Robadob\fgpu2\build2\lib\Release\python\venv\lib\site-packages\pluggy_callers.py", line 39 in _multicall
File "C:\Users\Robadob\fgpu2\build2\lib\Release\python\venv\lib\site-packages\pluggy_manager.py", line 80 in _hookexec
File "C:\Users\Robadob\fgpu2\build2\lib\Release\python\venv\lib\site-packages\pluggy_hooks.py", line 265 in call
File "C:\Users\Robadob\fgpu2\build2\lib\Release\python\venv\lib\site-packages_pytest\main.py", line 322 in _main
File "C:\Users\Robadob\fgpu2\build2\lib\Release\python\venv\lib\site-packages_pytest\main.py", line 268 in wrap_session
File "C:\Users\Robadob\fgpu2\build2\lib\Release\python\venv\lib\site-packages_pytest\main.py", line 315 in pytest_cmdline_main
File "C:\Users\Robadob\fgpu2\build2\lib\Release\python\venv\lib\site-packages\pluggy_callers.py", line 39 in _multicall
File "C:\Users\Robadob\fgpu2\build2\lib\Release\python\venv\lib\site-packages\pluggy_manager.py", line 80 in _hookexec
File "C:\Users\Robadob\fgpu2\build2\lib\Release\python\venv\lib\site-packages\pluggy_hooks.py", line 265 in call
File "C:\Users\Robadob\fgpu2\build2\lib\Release\python\venv\lib\site-packages_pytest\config__init.py", line 164 in main
File "C:\Users\Robadob\fgpu2\build2\lib\Release\python\venv\lib\site-packages_pytest\config__init__.py", line 187 in console_main
File "C:\Users\Robadob\fgpu2\build2\lib\Release\python\venv\Scripts\py.test.exe\main__.py", line 7 in
File "C:\ProgramData\Miniconda3\lib\runpy.py", line 87 in _run_code
File "C:\ProgramData\Miniconda3\lib\runpy.py", line 197 in _run_module_as_main
Warning: Input file 'test.json' refers to second input file 'invalid', this will not be loaded.
RTC Initialisation Processing time: 0.000000 s
Warning: Input file 'test.json' refers to second input file 'invalid', this will not be loaded.
Step 1 Processing time: 0.000000 s
Warning: Input file 'test.xml' refers to second input file 'invalid', this will not be loaded.
RTC Initialisation Processing time: 0.000000 s
Warning: Input file 'test.xml' refers to second input file 'invalid', this will not be loaded.
Step 1 Processing time: 0.000000 s
(venv) (base) C:\Users\Robadob\fgpu2\tests\swig\python>