cholla-hydro / cholla

A GPU-based hydro code
https://github.com/cholla-hydro/cholla/wiki
MIT License
65 stars 32 forks source link

What should we do about device memory checking? #296

Closed alwinm closed 11 months ago

alwinm commented 1 year ago

With -DCUDA_ERROR_CHECK, all cudaMallocs which run out of memory return an error code as expected on both Nvidia and AMD (ppc-n0 and Crusher), and execution is halted.

Without -DCUDA_ERROR_CHECK, cudaMallocs which run out of memory silently fail. On Nvidia, it may continue to run but do nothing on each timestep. On AMD, memory access fault.

In either case, the returned pointer is NULL, which is an easy way to check for failure.

I think there are multiple valid options:

  1. Leave as is. Users should be expected to understand their memory requirements and/or turn on CUDA_ERROR_CHECK when debugging.
  2. Add a util which prints gpu memory usage to help with the above.
  3. Replace cudaMalloc with a macro gpuMalloc which calls cudaMalloc and always does a check, whether or not CUDA_ERROR_CHECK is on, printing LINE/FILE and calling chexit(-1) upon failure. The check could be one or multiple of the following:
    pointer != NULL

    or

    cudaError err == cudaGetLastError();
    err == cudaSuccess

    or

    cudaError err = cudaMalloc();
    err == cudaSuccess
bcaddy commented 1 year ago

I generally like option 2 more. I don't see how option 3 works without setting err to something.

I would argue that this is an extension of issue #286 and we should address both at once. I added a proposal to that PR on what code we should use

bcaddy commented 1 year ago

Also, note that cudaGetLastError() and cudaPeekAtLastError() aren't the same. The Get version resets the error to cudaSuccess after it's done and the Peek version doesn't. My guess is that we want the former since we don't actually do any handling of errors and instead just exit. source

alwinm commented 1 year ago

I generally like option 2 more. I don't see how option 3 works without setting err to something.

I would argue that this is an extension of issue #286 and we should address both at once. I added a proposal to that PR on what code we should use

I do not consider this an extension of issue #286 because cudaMalloc is special:

alwinm commented 1 year ago

I updated my comment because I got the code block syntax wrong and it left out a line

bcaddy commented 1 year ago

I guess I didn't mean that it was a perfect extension, more that we can address both in on fell swoop.

  1. It does have it's own sync but if we have the sync in the error check turned on then two syncs right in a row isn't an issue. If we don't then it doesn't matter
  2. I think my proposed code addresses that and should allow both uses. If not then a second macro calling the same function would do it just fine
  3. I agree. I think that the only thing that CUDA_ERROR_CHECK should actually do is turn the sync on and off (we'll probably need to rename it then). The rest of the check is so simple and fast with such a huge potential benefit that I think we should use it in all builds
alwinm commented 1 year ago

Agreed, I like your approach.

It looks like a good approach will be, "leave as is", but re-do CudaSafeCall etc.

bcaddy commented 1 year ago

I think re-doing our GPU error checking so that they actually check errors always will address this without having to do any major changes to how we handle CUDA API calls.

bvillasen commented 1 year ago

My suggestion: Before calling cudaMalloc you can call cudaMemGetInfo to get the available memory in the device. If the available memory is less than what is going to be allocated by cudaMalloc then the simulation can exit printing that there is not enough device memory for that configuration, either run a smaller grid or use more gpus.

bcaddy commented 1 year ago

That would be easy to add to the DeviceVector constructor

bcaddy commented 11 months ago

I think that this has been resolved by PRs #322 and #350. Can I close this?

evaneschneider commented 11 months ago

Fine by me!