Open koparasy opened 5 years ago
@koparasy do you always see the same cell as the problem or does that change from time to time. Some of the CUDA versions have an unidentified race condition. I believe the fix since no one was able to find it was to synchronize after each kernel.
Note this code was developed by Nvidia and is not officially maintained. I will reach out to them and see what the fix was and if they can provide anything.
@ikarlin, No the cell id as well as the iteration number change on different executions.
@koparasy thanks. I have confirmed with Nvidia this is the known race condition. We are discussing the best way to get the fix into the code. Do you have a timeline you need this done on? That might influence our choice.
I'm having the same issue. Is the race condition solved now?
I am running Lulesh on a single node with 160 cpus and 4 (Tesla V100-SXM2) gpus. I am using openmpi-3.0.0 with cuda cuda 9.1. I execute the following command: mpirun -n 27 ./lulesh -s 60 and I get the following error: Rank 22: Volume Error in cell 211619 at iteration 14 The error appears in different number of iterations on each execution. Any idea what is causing this error?