Multiple versions of cuda error checking with unclear performance impacts

cholla-hydro / cholla

A GPU-based hydro code

https://github.com/cholla-hydro/cholla/wiki

MIT License

60 stars 32 forks source link

Multiple versions of cuda error checking with unclear performance impacts #286

Closed bcaddy closed 7 months ago

bcaddy commented 1 year ago

There's 3 versions of cuda error checking in global_cuda.h, gpuErrchk, CudaSafeCall, and CudaCheckError. They all do pretty much the same thing but with small differences, some unclear performance impacts, different syntax/usage, and are or are not behind the CUDA_ERROR_CHECK ifdef. I think we should merge these into one, clarify how they should be used, and put only the expensive cudaDeviceSynchronize behind an ifdef.

I would like some input on the best way to do this and what potential traps await.

alwinm commented 1 year ago

Some of my preliminary thoughts on the matter:

I like the version where it accurately tells you which file and line invoked the problem using the line and file macros.

Synchronize is necessary to make it actually useful for kernels, but not necessary for things like cudamemcpy which have their own synchronize... but agreed that it should behind an ifdef. In the grand scheme of things it is not prohibitively expensive but I do agree that it's usually an unnecessary cost. Having a single version will help ensure that all such calls uniformly toggle synchronize.

In terms of naming we should probably move towards gpu rather than cuda, and CheckError rather than SafeCall (since it is not really guaranteeing safety...? unless I am misunderstanding something).

bcaddy commented 1 year ago

Currently it looks like gpuErrchk isn't actually used anywhere and without CUDA_ERROR_CHECK the other two do absolutely nothing which is problematic. Here's my psuedo-code proposal that I think works for all use cases and will actually check things

#define gpuCheckError(code) { gpuAssert((code), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code = cudaPeekAtLastError(), const char *file, int line, bool abort=true)
{
  #ifdef CUDA_ERROR_CHECK
  code = cudaDeviceSynchronize();
  #endif

   if (code != cudaSuccess) 
   {
      fprintf(stderr,"GPUassert: failed at %s:%i with code %s\n", file, line, cudaGetErrorString(code));
      if (abort) {
        exit(code);
      }
   }
}

I'm not sure it the argument defaulting will work with the macro, if not we might need two macros, one for checking kernel launches and one for everything else.

alwinm commented 1 year ago

I know you said its pseudocode, but I recommend replacing exit with chexit, and we should think about what we really want to do with the print statement if multiple MPI ranks fail.

bcaddy commented 1 year ago

I edited the code to add chexit.

My guess is that there will be 2 primary failure modes:

Every rank fails. This is likely a logic bug and should be caught in single GPU or small MPI runs. No reason not to have every rank print
A small handful of ranks fail. This is likely a hardware failure or a sneaky bug that only exists at scale, either way we want it to print so we don't just get a silent exit

While it's totally possible for every rank to fail only at scale, I don't know if it's terribly likely compared to the other options. So I think we should just have it print the errors. Worst case we have to write a python script to deal with the output or figure out some more sophisticated logging in the future.

bcaddy commented 7 months ago

Resolved by #350