Closed hummingtree closed 1 year ago
I am surprised it runs correctly on a single V100. That is a very large grid for a single GPU - there is probably not enough memory space to run it all at once, so it is getting split up and my best guess is that something is going wrong with the free memory calculation for the A100s. What is the global memory size for the A100's and V100's you are running on?
There is a function in "subgrid_routines_3D.cu" that checks for the amount of free memory available on the device and then determines the size of the sub-blocks to split the volume into for cases where the problem size is too big to fit in the GPU global memory. To my knowledge, this has not been tested on an A100, and it's possible it behaves differently than it did on the V100. If the sub-block size is too big, it could cause the kind of error you are seeing.
@evaneschneider In our runs A100 has 80 GB of memory and V100 has 16 GB.
Do you mean the function cudaMemGetInfo
? I do not think it behaves differently between V100 and A100.
Yeah, that's the function. My guess is that if the code runs properly on 4 or 8 A100's but not 1 or 2, that probably means that on 4 the grid is no longer needs to be split up and everything works fine, but on 1 or 2 it should be, and perhaps something is going wrong. Assuming you are running in the main branch, you could try checking this by printing out the "sub grid dimensions" in the file VL_3D_cuda.cu (that is, uncomment line 51).
I marked with a label so that future users know this might be a useful thread to look at, closing as "not planned" for now, feel free to re-open if needed.
Compiling and running Cholla (Makefile and input deck attached) on 1 or 2 A100 GPUs gives the following out of bound memory error (error message obtained with
compute-sanitizer
), and the program exits normally with very few (3
) number of steps. Note that runs on 4 or 8 A100 GPUs, and runs on 1, 2, 4, 8 V100 GPUs are good.I do not know we are doing something wrong here?
Cholla std output
compute-sanitizer
errorMakefile
Input deck