Closed bcaddy closed 7 months ago
Some of my preliminary thoughts on the matter:
I like the version where it accurately tells you which file and line invoked the problem using the line and file macros.
Synchronize is necessary to make it actually useful for kernels, but not necessary for things like cudamemcpy which have their own synchronize... but agreed that it should behind an ifdef. In the grand scheme of things it is not prohibitively expensive but I do agree that it's usually an unnecessary cost. Having a single version will help ensure that all such calls uniformly toggle synchronize.
In terms of naming we should probably move towards gpu rather than cuda, and CheckError rather than SafeCall (since it is not really guaranteeing safety...? unless I am misunderstanding something).
Currently it looks like gpuErrchk
isn't actually used anywhere and without CUDA_ERROR_CHECK
the other two do absolutely nothing which is problematic. Here's my psuedo-code proposal that I think works for all use cases and will actually check things
#define gpuCheckError(code) { gpuAssert((code), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code = cudaPeekAtLastError(), const char *file, int line, bool abort=true)
{
#ifdef CUDA_ERROR_CHECK
code = cudaDeviceSynchronize();
#endif
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: failed at %s:%i with code %s\n", file, line, cudaGetErrorString(code));
if (abort) {
exit(code);
}
}
}
I'm not sure it the argument defaulting will work with the macro, if not we might need two macros, one for checking kernel launches and one for everything else.
I know you said its pseudocode, but I recommend replacing exit with chexit, and we should think about what we really want to do with the print statement if multiple MPI ranks fail.
I edited the code to add chexit
.
My guess is that there will be 2 primary failure modes:
While it's totally possible for every rank to fail only at scale, I don't know if it's terribly likely compared to the other options. So I think we should just have it print the errors. Worst case we have to write a python script to deal with the output or figure out some more sophisticated logging in the future.
Resolved by #350
There's 3 versions of cuda error checking in
global_cuda.h
,gpuErrchk
,CudaSafeCall
, andCudaCheckError
. They all do pretty much the same thing but with small differences, some unclear performance impacts, different syntax/usage, and are or are not behind theCUDA_ERROR_CHECK
ifdef. I think we should merge these into one, clarify how they should be used, and put only the expensivecudaDeviceSynchronize
behind an ifdef.I would like some input on the best way to do this and what potential traps await.