LatticeQCD / SIMULATeQCD

SIMULATeQCD is a multi-GPU Lattice QCD framework that makes it easy for physicists to implement lattice QCD formulas while still providing competitive performance.
https://latticeqcd.github.io/SIMULATeQCD/
MIT License
29 stars 11 forks source link

”Checksum mismatch“ bug #119

Closed haitaoshu closed 1 year ago

haitaoshu commented 1 year ago

when running gradientFlow on a $96^3\times 36$ lattice using 1 gpu on perlmutter I got "Checksum mismatch" error, but with 4 gpus it's running fine. So the "Checksum mismatch" error should actually be a "out of memory" error.

clarkedavida commented 1 year ago

Note: this is a NERSC configuration generated by RHMC.

haitaoshu commented 1 year ago

Same test on marconi throws the expected (correct) error: FATAL: [Rank 0] A GPU error occured: _rawPointer: Failed to allocate (additional) 18.3459 GB of memory on device: out of memory ( cudaErrorMemoryAllocation ) so it's a problem of the machine, not our code.

clarkedavida commented 1 year ago

Since it's a problem of the machine rather than the code, I will close the issue.