celeritas-project / celeritas

Celeritas is a new Monte Carlo transport code designed to accelerate scientific discovery in high energy physics by improving detector simulation throughput and energy efficiency using GPUs.
https://celeritas-project.github.io/celeritas/user/index.html
Other
62 stars 33 forks source link

Debug parallel crashes running with multiple streams on Frontier #1313

Open sethrj opened 3 months ago

sethrj commented 3 months ago

We discovered that ROCm 5.7.1 and higher hang during multithreaded Geant4 runs. The problem appears to be a regression in the async memory allocation that results in a race condition, or possibly a bug in thrust: we've seen some cases where a kernel launch on one thread and an async malloc/free on another cause the app to lock up.

TODO: fill this in from OLCF help tickets

sethrj commented 2 months ago

Worked around with using #1318