Closed haitaoshu closed 1 year ago
Note: this is a NERSC configuration generated by RHMC.
Same test on marconi throws the expected (correct) error: FATAL: [Rank 0] A GPU error occured: _rawPointer: Failed to allocate (additional) 18.3459 GB of memory on device: out of memory ( cudaErrorMemoryAllocation ) so it's a problem of the machine, not our code.
Since it's a problem of the machine rather than the code, I will close the issue.
when running gradientFlow on a $96^3\times 36$ lattice using 1 gpu on perlmutter I got "Checksum mismatch" error, but with 4 gpus it's running fine. So the "Checksum mismatch" error should actually be a "out of memory" error.