Closed texadactyl closed 2 years ago
E.g. in rawspec_gpu.cu, line 1742:
// For each output product
for(i=0; i < ctx->No; i++) {
// Clear power output buffer
cuda_rc = cudaMemset(gpu_ctx->d_pwr_out[i], 0,
abs(ctx->Npolout[i])*ctx->Nb*ctx->Ntpb*ctx->Nc*sizeof(float));
if(cuda_rc != cudaSuccess) {
PRINT_ERRMSG(cuda_rc);
return 0; <------------------------------------- should return 1 ?
}
}
In rawspec_gpu.cu, line 1760:
unsigned int rawspec_check_for_completion(rawspec_context * ctx)
{
int complete = 0;
cudaError_t rc;
rawspec_gpu_context * gpu_ctx = (rawspec_gpu_context *)ctx->gpu_ctx;
rc = cudaStreamQuery(gpu_ctx->compute_stream);
if(rc == cudaSuccess) {
complete++;
} <--------------------------------------- What is rc != cudaSuccess ?
return complete;
}
Not all of the rc != cudaSuccess processing call rawspec_cleanup(ctx)
when they detect an error.
Should cleanup always be invoked when a cuda error is detected?
In rawspec_gpu.cu, line 1760:
unsigned int rawspec_check_for_completion(rawspec_context * ctx) { int complete = 0; cudaError_t rc; rawspec_gpu_context * gpu_ctx = (rawspec_gpu_context *)ctx->gpu_ctx; rc = cudaStreamQuery(gpu_ctx->compute_stream); if(rc == cudaSuccess) { complete++; } <--------------------------------------- What is rc != cudaSuccess ? return complete; }
A cudaSuccess
return value from cudaStreamQuery
means the stream has completed. Any other value means the stream has not completed. This function translates a cudaSuccess
return value into a C "true" value (i.e. non-zero) and any non-cudaSuccess
return value into a C "false" value (i.e. zero). Maybe NVIDIA could have invented yet another constant such as cudaStreamComplete
to indicate that the stream has completed, but they opted not to.
The check around line 1742 could be made better for sure. I think the enclosing function, rawspec_reset_integration
should return the number of output buffers that were not successfully reset/cleared. So if they all are reset OK, rawspec_reset_integration
will return 0 for success. If the clearing of one buffer fails, the return value should be incremented, but the remaining output buffers should still be (attempted to be) cleared. Any non-zero return from rawspec_reset_integration
would indicate problems and the larger the value, the more buffers had problems being cleared.
Thanks for the line 1760 explanation. Understand.
Line 1742
I added an unnecessary call to rawspec_reset_integration
near the end of rawspec.c regarding issue #39.
In rawspec.c, the only other place that rawspec_reset_integration
is called is in the rawspec.c context of "Same as previous stem, just reset for new integration". Note that the return code is ignored in rawspec.c. Shouldn't rawspec exit with an error status if there is any sort of GPU failure? My experience is that once hardware or firmware misbehaves, it is best to go through a hard or soft reset.
FWIW, that is a very very long if-then-else starting at line 555 - challenging to read with only 2 characters of indentation!
Questions about error handling in rawspec_gpu.cu.