FLAMEGPU / FLAMEGPU2

FLAME GPU 2 is a GPU accelerated agent based modelling framework for CUDA C++ and Python
https://flamegpu.com
MIT License
106 stars 21 forks source link

Python Fatal Error during pytest #939

Closed ptheywood closed 1 year ago

ptheywood commented 2 years ago

@Robadob found that the python test suite would exit with a Fatal Python error: Aborted using the current master branch a76da8ca75a543513f5ea6c108a52da6f0c70afc.

I've reproduced this under linux, for a release builds with and without visualisation enabled.

As rob found, the reported error in io/test_logging_exceptions.py is fine in isolation. The io suite on its own is OK.

Running the gpu and io suites together triggers the error too:

python3 -m pytest ../tests/swig/python/gpu/ ../tests/swig/python/io/

... 

../tests/swig/python/io/test_logging_exceptions.py ..Fatal Python error: Aborted

I've managed to trigger a CUDAError by narrowing the test down further, to test_cuda_simulation.py & test_logging.py.

python3 -m pytest ../tests/swig/python/gpu/test_cuda_simulation.py ../tests/swig/python/io/test_logging.py 
===================================== test session starts =====================================
platform linux -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0
rootdir: /home/ptheywood/code/flamegpu/FLAMEGPU2
collected 23 items                                                                            

../tests/swig/python/gpu/test_cuda_simulation.py ......s.............                   [ 86%]
../tests/swig/python/io/test_logging.py ...                                             [100%]

================================ 22 passed, 1 skipped in 0.66s ================================
terminate called after throwing an instance of 'flamegpu::exception::CUDAError'
  what():  /home/ptheywood/code/flamegpu/FLAMEGPU2/include/flamegpu/gpu/detail/CUDAErrorChecking.cuh(28): CUDA Error: /home/ptheywood/code/flamegpu/FLAMEGPU2/src/flamegpu/runtime/utility/EnvironmentManager.cu(149): invalid argument
Aborted (core dumped)

Runnint test_cuda_simulation and io it reproduces the original error message.

python3 -m pytest ../tests/swig/python/gpu/test_cuda_simulation.py ../tests/swig/python/io/
===================================== test session starts =====================================
platform linux -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0
rootdir: /home/ptheywood/code/flamegpu/FLAMEGPU2
collected 30 items                                                                            

../tests/swig/python/gpu/test_cuda_simulation.py ......s.............                   [ 66%]
../tests/swig/python/io/test_io.py ..                                                   [ 73%]
../tests/swig/python/io/test_logging.py Fatal Python error: Aborted

Looks like EnvironmentManager::~EnvironmentManager() is related somehow, could be cuda in dtor related?.


Excerpt from #873 permalink

Have now built and ran test suite for each of the 3 curand engine's consecutively. All worked as expected.

When running pyflamegpu test suite, it crashes with a fatal error (see end of this comment). However, that error is also currently present in the master pyflamegpu (windows/release) suite too. Hence, I consider this PR ready.

This error only occurs running full suite, the individual test_logging_exceptions.py set of tests run fine in isolation.

io\test_logging_exceptions.py ..Fatal Python error: Aborted

Current thread 0x00007aa8 (most recent call first): File "C:\Users\Robadob\fgpu2\build2\lib\Release\python\venv\lib\site-packages_pytest\capture.py", line 780 in pytest_runtest_setup File "C:\Users\Robadob\fgpu2\build2\lib\Release\python\venv\lib\site-packages\pluggy_callers.py", line 34 in _multicall File "C:\Users\Robadob\fgpu2\build2\lib\Release\python\venv\lib\site-packages\pluggy_manager.py", line 80 in _hookexec File "C:\Users\Robadob\fgpu2\build2\lib\Release\python\venv\lib\site-packages\pluggy_hooks.py", line 265 in call File "C:\Users\Robadob\fgpu2\build2\lib\Release\python\venv\lib\site-packages_pytest\runner.py", line 259 in File "C:\Users\Robadob\fgpu2\build2\lib\Release\python\venv\lib\site-packages_pytest\runner.py", line 338 in from_call File "C:\Users\Robadob\fgpu2\build2\lib\Release\python\venv\lib\site-packages_pytest\runner.py", line 258 in call_runtest_hook File "C:\Users\Robadob\fgpu2\build2\lib\Release\python\venv\lib\site-packages_pytest\runner.py", line 219 in call_and_report File "C:\Users\Robadob\fgpu2\build2\lib\Release\python\venv\lib\site-packages_pytest\runner.py", line 124 in runtestprotocol File "C:\Users\Robadob\fgpu2\build2\lib\Release\python\venv\lib\site-packages_pytest\runner.py", line 111 in pytest_runtest_protocol File "C:\Users\Robadob\fgpu2\build2\lib\Release\python\venv\lib\site-packages\pluggy_callers.py", line 39 in _multicall File "C:\Users\Robadob\fgpu2\build2\lib\Release\python\venv\lib\site-packages\pluggy_manager.py", line 80 in _hookexec File "C:\Users\Robadob\fgpu2\build2\lib\Release\python\venv\lib\site-packages\pluggy_hooks.py", line 265 in call File "C:\Users\Robadob\fgpu2\build2\lib\Release\python\venv\lib\site-packages_pytest\main.py", line 347 in pytest_runtestloop File "C:\Users\Robadob\fgpu2\build2\lib\Release\python\venv\lib\site-packages\pluggy_callers.py", line 39 in _multicall File "C:\Users\Robadob\fgpu2\build2\lib\Release\python\venv\lib\site-packages\pluggy_manager.py", line 80 in _hookexec File "C:\Users\Robadob\fgpu2\build2\lib\Release\python\venv\lib\site-packages\pluggy_hooks.py", line 265 in call File "C:\Users\Robadob\fgpu2\build2\lib\Release\python\venv\lib\site-packages_pytest\main.py", line 322 in _main File "C:\Users\Robadob\fgpu2\build2\lib\Release\python\venv\lib\site-packages_pytest\main.py", line 268 in wrap_session File "C:\Users\Robadob\fgpu2\build2\lib\Release\python\venv\lib\site-packages_pytest\main.py", line 315 in pytest_cmdline_main File "C:\Users\Robadob\fgpu2\build2\lib\Release\python\venv\lib\site-packages\pluggy_callers.py", line 39 in _multicall File "C:\Users\Robadob\fgpu2\build2\lib\Release\python\venv\lib\site-packages\pluggy_manager.py", line 80 in _hookexec File "C:\Users\Robadob\fgpu2\build2\lib\Release\python\venv\lib\site-packages\pluggy_hooks.py", line 265 in call File "C:\Users\Robadob\fgpu2\build2\lib\Release\python\venv\lib\site-packages_pytest\config__init.py", line 164 in main File "C:\Users\Robadob\fgpu2\build2\lib\Release\python\venv\lib\site-packages_pytest\config__init__.py", line 187 in console_main File "C:\Users\Robadob\fgpu2\build2\lib\Release\python\venv\Scripts\py.test.exe\main__.py", line 7 in File "C:\ProgramData\Miniconda3\lib\runpy.py", line 87 in _run_code File "C:\ProgramData\Miniconda3\lib\runpy.py", line 197 in _run_module_as_main Warning: Input file 'test.json' refers to second input file 'invalid', this will not be loaded. RTC Initialisation Processing time: 0.000000 s Warning: Input file 'test.json' refers to second input file 'invalid', this will not be loaded. Step 1 Processing time: 0.000000 s Warning: Input file 'test.xml' refers to second input file 'invalid', this will not be loaded. RTC Initialisation Processing time: 0.000000 s Warning: Input file 'test.xml' refers to second input file 'invalid', this will not be loaded. Step 1 Processing time: 0.000000 s

(venv) (base) C:\Users\Robadob\fgpu2\tests\swig\python>

Robadob commented 2 years ago

I've not tried to reproduce it with visualisation enabled. That was a different issue preventing compilation.

ptheywood commented 2 years ago

Mixed up discussions from earlier. Repro'd without vis under linux and updated the above.

ptheywood commented 2 years ago

Narrowed it down to these three tests causing the CUDAError to be thrown. Removing any of the 3 stops the error being raised.

python3 -m pytest ../tests/swig/python/gpu/test_cuda_simulation.py ../tests/swig/python/io/test_logging.py ../tests/swig/python/io/test_logging_exceptions.py -vv -k "test_set_get_population_data or test_CUDAEnsembleSimulate"
===================================== test session starts =====================================
platform linux -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /home/ptheywood/code/flamegpu/FLAMEGPU2/build/lib/Debug/python/venv/bin/python3
cachedir: .pytest_cache
rootdir: /home/ptheywood/code/flamegpu/FLAMEGPU2
collected 28 items / 25 deselected / 3 selected                                               

../tests/swig/python/gpu/test_cuda_simulation.py::TestSimulation::test_set_get_population_data PASSED [ 33%]
../tests/swig/python/gpu/test_cuda_simulation.py::TestSimulation::test_set_get_population_data_invalid_cuda_agent PASSED [ 66%]
../tests/swig/python/io/test_logging.py::LoggingTest::test_CUDAEnsembleSimulate PASSED  [100%]

============================== 3 passed, 25 deselected in 0.55s ===============================
terminate called after throwing an instance of 'flamegpu::exception::CUDAError'
  what():  /home/ptheywood/code/flamegpu/FLAMEGPU2/include/flamegpu/gpu/detail/CUDAErrorChecking.cuh(28): CUDA Error: /home/ptheywood/code/flamegpu/FLAMEGPU2/src/flamegpu/runtime/utility/EnvironmentManager.cu(149): invalid argument
Aborted (core dumped)
ptheywood commented 2 years ago

cuda-gdb locks up pretty hard when debugging locally (might be better on a device not running my screen), so resorted to some printfin

EnvironmentManager::~EnvironmentManager:

    fprintf(stdout, "EnvironmentManager::~EnvironmentManager %p h_buffer %p d_buffer %p\n", this, h_buffer, d_buffer); fflush(stdout);

Which then outputs

../tests/swig/python/gpu/test_cuda_simulation.py::TestSimulation::test_set_get_population_data PASSED [ 33%]EnvironmentManager::~EnvironmentManager 0x55f741d77b90 h_buffer 0x55f7420b3b70 d_buffer 0x7f3075000000

../tests/swig/python/gpu/test_cuda_simulation.py::TestSimulation::test_set_get_population_data_invalid_cuda_agent PASSED [ 66%]
../tests/swig/python/io/test_logging.py::LoggingTest::test_CUDAEnsembleSimulate PASSED  [100%]

============================== 3 passed, 25 deselected in 0.54s ===============================
EnvironmentManager::~EnvironmentManager 0x55f74277be20 h_buffer 0x55f7420d4930 d_buffer 0x7f3075000000
terminate called after throwing an instance of 'flamegpu::exception::CUDAError'
  what():  /home/ptheywood/code/flamegpu/FLAMEGPU2/include/flamegpu/gpu/detail/CUDAErrorChecking.cuh(28): CUDA Error: /home/ptheywood/code/flamegpu/FLAMEGPU2/src/flamegpu/runtime/utility/EnvironmentManager.cu(153): invalid argument
EnvironmentManager::~EnvironmentManager 0x55edd7e48d00 h_buffer 0x55edd774eb60 d_buffer 0x7f50a3000000

I.e. the dtor is called on different instances of EnvironmentManager, both of which had teh same d_buffer but different h_buffer, so it's a double free.

ADding printf's to EnvironmentManager::init which sets the value of h_buffer didn't do anything which doesn't make much sense.

ptheywood commented 2 years ago

After a bisect, 631b2b036970ca74bd3bab616ee17ab0e32113f0 is the first bad commit in terms of pytest emitting CUDA/Python errors.

26ed242 is therefore the last good commit.

The offending commit changed device reset behaviour, which lines up with cuda errors being emitted when freeing device pointers.

I'm leaning towrads the root cause being the handling of python callbacks though, rather than the newer device reset logic being wrong. We think this may be related to #498.

ptheywood commented 2 years ago

I've implemented some wrappers around cudaFree, cudaFreeHost and manual alternate error handling for cudaStreamDestroy as a potential workaround for this. E.g. if the pointer is not valid for the current context don't attempt to free it. The stream logic is not quite as nice but as close as I can get.

This has been pushed to the wrapped-cuda-free branch, but no PR for now as I'm not convinced this is the final solution.

It does tempermentally result in a segfault within libcuda.so, which I can't get any useful debug info for, so I wouldn't merge this any time soon.


In general though I am leaning towards this being an issue with the device reset logic.

Individual simulations use a class-scoped atomic to track the number of CUDASimulation objects, and only resets the deice when this is 0.

In cases where there are CUDASimulation instances on different devices, this will only ever free theh device for the last destroyed simulation.

The CUDASimulation resetting also is disabled for the c++ test suite. This looks to have been made for performance reasons (context creation is relatively time consuming, and for 1000 tests that adds up). It is not disable for the python test suite, so may be a factor still.

CUDAEnsemble::simulate resets all devices involved in the ensemble, but it doesn't check to see if there are any active CUDASimulation instances, so will break any simualtions which exist at the same time.

// Create and initialise a simulation, to ensure there is device stuff involved
CUDASimulation sim = ...
sim.step();

// Create an ensemble and simualte it
CUDAEnsemble ens = ...
ense.simulate(); // This resets all devices invovled

sim.step(); // This will probably die miserably, as the cuda context has been reset

The above is not an intended use case, but is possible, and effectivly what is happening with the delayed CUDASimulation destruction due to python GC.


I've written a C++ test which uses the above pattern, and triggered the CUDAError seen in the python test suite, so the GC distraction was a bit of a red herring that distracted us from the root cause (even though I found it via the bisect, but didn't spot the actual issue at the time).

I've pushed this to the bad-device-reset branch, but will discuss this prior to implementing a fix.

./bin/Debug/tests --gtest_filter="TestCUDAEnsemble.ResetWithCUDASimulationInstance"
Running main() from /home/ptheywood/code/flamegpu/FLAMEGPU2/tests/helpers/main.cu
Note: Google Test filter = TestCUDAEnsemble.ResetWithCUDASimulationInstance
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from TestCUDAEnsemble
[ RUN      ] TestCUDAEnsemble.ResetWithCUDASimulationInstance
/home/ptheywood/code/flamegpu/FLAMEGPU2/include/flamegpu/gpu/detail/CUDAErrorChecking.cuh(28): CUDA Error: /home/ptheywood/code/flamegpu/FLAMEGPU2/src/flamegpu/runtime/utility/EnvironmentManager.cu(168): invalid device context
/home/ptheywood/code/flamegpu/FLAMEGPU2/tests/test_cases/gpu/test_cuda_ensemble.cu:305: Failure
Expected: simulation.step() doesn't throw an exception.
  Actual: it throws flamegpu::exception::CUDAError with description "/home/ptheywood/code/flamegpu/FLAMEGPU2/include/flamegpu/gpu/detail/CUDAErrorChecking.cuh(28): CUDA Error: /home/ptheywood/code/flamegpu/FLAMEGPU2/src/flamegpu/runtime/utility/EnvironmentManager.cu(168): invalid device context".
/home/ptheywood/code/flamegpu/FLAMEGPU2/include/flamegpu/gpu/detail/CUDAErrorChecking.cuh(28): CUDA Error: /home/ptheywood/code/flamegpu/FLAMEGPU2/src/flamegpu/exception/FLAMEGPUDeviceException.cu(17): invalid argument
terminate called after throwing an instance of 'flamegpu::exception::CUDAError'
  what():  /home/ptheywood/code/flamegpu/FLAMEGPU2/include/flamegpu/gpu/detail/CUDAErrorChecking.cuh(28): CUDA Error: /home/ptheywood/code/flamegpu/FLAMEGPU2/src/flamegpu/exception/FLAMEGPUDeviceException.cu(17): invalid argument
Aborted (core dumped)

Without the EXPECT_NO_THROW, letting the dtor run on scope change also triggers the exception and abort.

When combined with the wrapping from the other branch, it replicates the observed segfault.


We need to re-think the device reset logic, and implement tests for it including these sorts of edge cases. We won't hit them all but atleast a few will help prevent this.

ptheywood commented 2 years ago

After more thought I'm leaning towards us not doing automatic cudaDeviceResets within the library, as it mutates global state which is out of FLAMEGPU's scope (e.g. if a user has some CUDA memory allocated already, it would be ruined by some automatic, always-on part of a library, which feels like poor form).

We reset for 3 reasons really:

  1. To ensure profiler data is flushed.
  2. To ensure cuda-memcheck works correctly
  3. To ensure we've not left any of our global device stuff allocated or at non-default values.

We can ensure profiling data is flushed using other CUDA API CAlls (which might have been driver API only origianlly, so we learned a device reset was the way to do it).

This would be a cudaDeviceSynchronize() and a cudaProfilerStop for each active device / cuda context (so we may need to use the driver equivalent for models with RTC, if they use more than one / a separate context). If this is not done immediately prior to a exit, we would probably want to re-enable profiling via cudaProfilerStart (as stop is more than just a flush).

docs

This might not want to be automatic, as the start/stop methods might be required if someone wishes to only profile a small part of a long running application (starting is implicit when profiling, unless profiler CLI args are used). We could either control this via a CMake option, or make it a user-callable method (which we would want to add to every binary we produce and the templates)? We woudl lose device information once the ensemble / sim is dtor'd though, so it would probably need to be a method on the appropraite object?


Accurate leak checking via cuda-memcheck etc requires a context destruction (so deviec reset in runtime API).

3.8. Leak Checking The memcheck tool can detect leaks of allocated memory.

Memory leaks are device side allocations that have not been freed by the time the context is destroyed. The memcheck tool tracks device memory allocations created using the CUDA driver or runtime APIs. Starting in CUDA 5, allocations that are created dynamically on the device heap by calling malloc() inside a kernel are also tracked.

For an accurate leak checking summary to be generated, the application's CUDA context must be destroyed at the end. This can be done explicitly by calling cuCtxDestroy() in applications using the CUDA driver API, or by calling cudaDeviceReset() in applications programmed against the CUDA run time API.

The --leak-check full option must be specified to enable leak checking.

Again though, this doesn't need to be as early, but would need to reset all devices which had been touched. With this in place though it will find any bad free/malloc pairs.


In terms of making sure we haven't left any global state in a bad way, we'd just need to make sure we don't do that. IIRC there is only one global device symbol still defined, so otherwise it should just be making sure all allocations are deallocated at the correct time

mondus commented 2 years ago

On balance of factors the final proposed solution is safe exit function to do device resets.