Handle GPUs that lack full NVML Support

ComputationalRadiationPhysics / cuda_memtest

Fork of CUDA GPU memtest :eyeglasses:

http://sourceforge.net/projects/cudagpumemtest

110 stars 31 forks source link

Handle GPUs that lack full NVML Support #16

Open ax3l opened 5 years ago

ax3l commented 5 years ago

Nvidia NVML does not support non-Tesla product very well. Problems are known with mobile cards and even Quadro cards. (Reported as RFE to Nvidia as Bug ID 2417658.)

Anyway, this can lead to cuda_memtest throwing an [NVML] Error: Not supported (in nvmlDeviceGetSerial) exception which we should catch.

berceanu commented 5 years ago

Testing on a GTX 950M, I get this while running PIConGPU:

</home/berceanu/src/spack/opt/spack/linux-ubuntu18.04-x86_64/gcc-7.3.0/picongpu-0.4.0-lqbxwsudtgms2do4ksm57uovvv4ypx4e/thirdParty/cuda_memtest/misc.cpp>:35

It seems to be just a warning, as the simulation completes after that.

See that disabling the memtest fixes it:

pic-build -b "cuda:50" -c "-DCUDAMEMTEST_ENABLE=OFF"

Should we add a known issue in the docs for non-tesla cards?

ax3l commented 5 years ago

Thx for the report! Can you please post the warning? Is there a line missing?

berceanu commented 5 years ago

Nope, there is only that single line.

ax3l commented 5 years ago

Ah ok, but it does not abort, yes!

Ok, we have to clean up that macro, it should not randomly start to write to cerr: https://github.com/ComputationalRadiationPhysics/cuda_memtest/blob/7a585d504831431d0e95ff00d0217181201dbb12/cuda_memtest.h#L146-L150

ax3l commented 5 years ago

I proposed a fix in #18 that should remove that noisy line from your output. It can (rightfully) be ignored.