On a 4 node run, the memory check with cuda-memcheck is ok if the standard tool is used (memcheck): no bad memory accesses should be present. I have used the following srun command line:
Instead, using the racecheck tool (just adding the --racecheck option after cuda-memcheck in the line above) a large number of error are reported for a cublas call. All the messages are of the following type:
========= Race reported between Read access at 0x00002648 in dgemm_sm_heavy_ldg_nn
========= and Write access at 0x000021c8 in dgemm_sm_heavy_ldg_nn [897 hazards]
========= and Write access at 0x000020b8 in dgemm_sm_heavy_ldg_nn [2000 hazards]
========= and Write access at 0x000022d0 in dgemm_sm_heavy_ldg_nn [2127 hazards]
========= Race reported between Write access at 0x000022d0 in dgemm_sm_heavy_ldg_nn
========= and Read access at 0x00002630 in dgemm_sm_heavy_ldg_nn [2150 hazards]
========= and Read access at 0x00002648 in dgemm_sm_heavy_ldg_nn [2127 hazards]
Most likely they are false positives, as confirmed by NVIDIA. No error is reported for the code strictly but unfortunately these jobs are extremely slow and are currently failing due to what seems to be a slurm related problem. So it is possible that something is wrong in the part of the code not yet executed. The error reported by slurm when racecheck is used is the following:
srun: Terminating job step 489353.6
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: nid00303: task 3: Killed
srun: error: nid00301: task 1: Killed
srun: error: nid00302: task 2: Killed
and the last messages are
STATUS=137
Copying over the home directory from root rank
JOB DONE
cuda-memcheck
On a 4 node run, the memory check with cuda-memcheck is ok if the standard tool is used (memcheck): no bad memory accesses should be present. I have used the following srun command line:
Instead, using the racecheck tool (just adding the --racecheck option after cuda-memcheck in the line above) a large number of error are reported for a cublas call. All the messages are of the following type:
Most likely they are false positives, as confirmed by NVIDIA. No error is reported for the code strictly but unfortunately these jobs are extremely slow and are currently failing due to what seems to be a slurm related problem. So it is possible that something is wrong in the part of the code not yet executed. The error reported by slurm when racecheck is used is the following:
and the last messages are
STATUS=137 Copying over the home directory from root rank JOB DONE