"out of memory" on V100

ComputationalRadiationPhysics / cuda_memtest

Fork of CUDA GPU memtest :eyeglasses:

http://sourceforge.net/projects/cudagpumemtest

114 stars 32 forks source link

"out of memory" on V100 #15

Open ax3l opened 6 years ago

ax3l commented 6 years ago

cuda_memtest seems to abort with "out of memory" (line 148 in cuda_memtests.cu) when run in a container (nvidia-docker1 and 2) on V100 GPUs.

The problem might be a general one or just triggered in PIConGPU. Needs investigation. Maybe just multiple-times assigned from mpiInfo...

Occurred with a 4 & 8 GPU PIConGPU lwfa example on a DGX-1.

RenaKunisaki commented 5 years ago

I have the same problem and not using Docket:

~> ocl_memtest 
hostname is guilmon
CL_PLATFORM_NAME:   NVIDIA CUDA
CL_PLATFORM_VERSION:    OpenCL 1.2 CUDA 10.2.120
                    Device 0 is CL_DEVICE_TYPE_GPU, "GeForce GTX 950"
allocated 340 Mbytes from device 0
[05/17/2019 15:33:40][guilmon][0]:Test0 [Walking 1 bit]
[05/17/2019 15:33:40][guilmon][0]:Test0: global walk test
ERROR: opencl call failed with rc(-5), line 39, file ocl_tests.cpp
Error: Out of resources

(Does that just mean the test failed?)

psychocoderHPC commented 5 years ago

@RenaKunisaki We never tested the opencl version of cuda_memtest. Depending of the driver version OpenCL is not able to allocate 100% of the main gpu memory. Could you rerun your your test with cuda_memtest?

ax3l commented 5 years ago

Also take care if your X server is running on the same device.

RenaKunisaki commented 5 years ago

I installed it from Arch package (AUR) and I don't seem to have cuda_memtest binary. I will try without X running though.

ax3l commented 5 years ago

Oh, if you are taking the aur package (here?) it will take the legacy sourceforge version. We haven't seen much activity on that one since years and thus update and fix our own forked CUDA version here.

If you find updates to the OpenCL version we will gladly review and merge pull requests.