Closed alliepiper closed 3 months ago
Hello! I wanted to fix this issue and inserted NVBENCH_CUDA_CALL(cudaGetLastError())
after benchmark execution. Nothing changed. It seems that cudaGetLastError()
doesn't reset the illegal memory access
error: https://stackoverflow.com/questions/43659314/how-can-i-reset-the-cuda-error-to-success-with-driver-api-after-a-trap-instructi
@alliepiper are there any other ways to reset the error except cudaDeviceReset()
?
@psvvsp As described in the answer you linked, in the case of a catastrophic error like a illegal access it does not make sense to continue. In such a case the NVBENCH_CUDA_CALL(cudaGetLastError());
should just make sure that the error is reported for the right call to make it easier to fix for the user.
I guess you are trying to just catch the exception thrown by NVBENCH_CUDA_CALL()
and continue but continuation is only possible/sensible when the CUDA error is non-sticky. One could argue that for a sticky error nvbench should just exit instead of throwing an exception but that discussion would not belong under this issue.
Benchmarking code that produces these catastrophic errors is a BS in -> BS out situation (i.e. I would not trust the output either way) that can only be fixed by fixing your code.
@pauleonix @alliepiper Please review my pull request: https://github.com/NVIDIA/nvbench/pull/173.
@psvvsp I'm just an interested bystander, the PR needs to be reviewed by someone who can merge it.
We recently had an issue where a benchmark kernel caused an illegal memory access, and the error was asynchronously reported in an unrelated NVBench CUDA API call. Any errors emitted during the execution of a benchmark should be correctly reported as originating from the benchmark execution.
We should add
NVBENCH_CUDA_CALL(cudaGetLastError());
or similar in thedetail/measure_*
runners after each kernel execution.