Volume error When running with cuda and mpi

LLNL / LULESH

Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics (LULESH)

https://asc.llnl.gov/codes/proxy-apps/lulesh

99 stars 82 forks source link

Volume error When running with cuda and mpi #11

Open koparasy opened 5 years ago

koparasy commented 5 years ago

I am running Lulesh on a single node with 160 cpus and 4 (Tesla V100-SXM2) gpus. I am using openmpi-3.0.0 with cuda cuda 9.1. I execute the following command: mpirun -n 27 ./lulesh -s 60 and I get the following error: Rank 22: Volume Error in cell 211619 at iteration 14 The error appears in different number of iterations on each execution. Any idea what is causing this error?

ikarlin commented 5 years ago

@koparasy do you always see the same cell as the problem or does that change from time to time. Some of the CUDA versions have an unidentified race condition. I believe the fix since no one was able to find it was to synchronize after each kernel.

Note this code was developed by Nvidia and is not officially maintained. I will reach out to them and see what the fix was and if they can provide anything.

koparasy commented 5 years ago

@ikarlin, No the cell id as well as the iteration number change on different executions.

ikarlin commented 5 years ago

@koparasy thanks. I have confirmed with Nvidia this is the known race condition. We are discussing the best way to get the fix into the code. Do you have a timeline you need this done on? That might influence our choice.

HenryYihengXu commented 3 years ago

I'm having the same issue. Is the race condition solved now?