Check for CUDA errors after running benchmark

alliepiper commented 2 years ago

We recently had an issue where a benchmark kernel caused an illegal memory access, and the error was asynchronously reported in an unrelated NVBench CUDA API call. Any errors emitted during the execution of a benchmark should be correctly reported as originating from the benchmark execution.

We should add NVBENCH_CUDA_CALL(cudaGetLastError()); or similar in the detail/measure_* runners after each kernel execution.

psvvsp commented 3 months ago

Hello! I wanted to fix this issue and inserted NVBENCH_CUDA_CALL(cudaGetLastError()) after benchmark execution. Nothing changed. It seems that cudaGetLastError() doesn't reset the illegal memory access error: https://stackoverflow.com/questions/43659314/how-can-i-reset-the-cuda-error-to-success-with-driver-api-after-a-trap-instructi @alliepiper are there any other ways to reset the error except cudaDeviceReset()?

pauleonix commented 3 months ago

@psvvsp As described in the answer you linked, in the case of a catastrophic error like a illegal access it does not make sense to continue. In such a case the NVBENCH_CUDA_CALL(cudaGetLastError()); should just make sure that the error is reported for the right call to make it easier to fix for the user.

I guess you are trying to just catch the exception thrown by NVBENCH_CUDA_CALL() and continue but continuation is only possible/sensible when the CUDA error is non-sticky. One could argue that for a sticky error nvbench should just exit instead of throwing an exception but that discussion would not belong under this issue.

Benchmarking code that produces these catastrophic errors is a BS in -> BS out situation (i.e. I would not trust the output either way) that can only be fixed by fixing your code.

psvvsp commented 3 months ago

@pauleonix @alliepiper Please review my pull request: https://github.com/NVIDIA/nvbench/pull/173.

pauleonix commented 3 months ago

@psvvsp I'm just an interested bystander, the PR needs to be reviewed by someone who can merge it.

NVIDIA / nvbench

Check for CUDA errors after running benchmark #88