Closed alwinm closed 11 months ago
I generally like option 2 more. I don't see how option 3 works without setting err
to something.
I would argue that this is an extension of issue #286 and we should address both at once. I added a proposal to that PR on what code we should use
Also, note that cudaGetLastError()
and cudaPeekAtLastError()
aren't the same. The Get
version resets the error to cudaSuccess
after it's done and the Peek
version doesn't. My guess is that we want the former since we don't actually do any handling of errors and instead just exit. source
I generally like option 2 more. I don't see how option 3 works without setting
err
to something.I would argue that this is an extension of issue #286 and we should address both at once. I added a proposal to that PR on what code we should use
I do not consider this an extension of issue #286 because cudaMalloc is special:
I updated my comment because I got the code block syntax wrong and it left out a line
I guess I didn't mean that it was a perfect extension, more that we can address both in on fell swoop.
CUDA_ERROR_CHECK
should actually do is turn the sync on and off (we'll probably need to rename it then). The rest of the check is so simple and fast with such a huge potential benefit that I think we should use it in all buildsAgreed, I like your approach.
It looks like a good approach will be, "leave as is", but re-do CudaSafeCall etc.
I think re-doing our GPU error checking so that they actually check errors always will address this without having to do any major changes to how we handle CUDA API calls.
My suggestion: Before calling cudaMalloc you can call cudaMemGetInfo to get the available memory in the device. If the available memory is less than what is going to be allocated by cudaMalloc then the simulation can exit printing that there is not enough device memory for that configuration, either run a smaller grid or use more gpus.
That would be easy to add to the DeviceVector constructor
I think that this has been resolved by PRs #322 and #350. Can I close this?
Fine by me!
With -DCUDA_ERROR_CHECK, all cudaMallocs which run out of memory return an error code as expected on both Nvidia and AMD (ppc-n0 and Crusher), and execution is halted.
Without -DCUDA_ERROR_CHECK, cudaMallocs which run out of memory silently fail. On Nvidia, it may continue to run but do nothing on each timestep. On AMD, memory access fault.
In either case, the returned pointer is NULL, which is an easy way to check for failure.
I think there are multiple valid options:
chexit(-1)
upon failure. The check could be one or multiple of the following:or
or