cuda API error returns should always be detected and processed correctly

texadactyl commented 2 years ago

Questions about error handling in rawspec_gpu.cu.

texadactyl commented 2 years ago

E.g. in rawspec_gpu.cu, line 1742:

  // For each output product
  for(i=0; i < ctx->No; i++) {
    // Clear power output buffer
    cuda_rc = cudaMemset(gpu_ctx->d_pwr_out[i], 0,
        abs(ctx->Npolout[i])*ctx->Nb*ctx->Ntpb*ctx->Nc*sizeof(float));
    if(cuda_rc != cudaSuccess) {
      PRINT_ERRMSG(cuda_rc);
      return 0;  <------------------------------------- should return 1 ?
    }
  }

texadactyl commented 2 years ago

In rawspec_gpu.cu, line 1760:

unsigned int rawspec_check_for_completion(rawspec_context * ctx)
{
  int complete = 0;
  cudaError_t rc;
  rawspec_gpu_context * gpu_ctx = (rawspec_gpu_context *)ctx->gpu_ctx;

  rc = cudaStreamQuery(gpu_ctx->compute_stream);
  if(rc == cudaSuccess) {
    complete++;
  } <--------------------------------------- What is rc != cudaSuccess ?

  return complete;
}

texadactyl commented 2 years ago

Not all of the rc != cudaSuccess processing call rawspec_cleanup(ctx) when they detect an error. Should cleanup always be invoked when a cuda error is detected?

david-macmahon commented 2 years ago

In rawspec_gpu.cu, line 1760:

unsigned int rawspec_check_for_completion(rawspec_context * ctx)
{
  int complete = 0;
  cudaError_t rc;
  rawspec_gpu_context * gpu_ctx = (rawspec_gpu_context *)ctx->gpu_ctx;

  rc = cudaStreamQuery(gpu_ctx->compute_stream);
  if(rc == cudaSuccess) {
    complete++;
  } <--------------------------------------- What is rc != cudaSuccess ?

  return complete;
}

A cudaSuccess return value from cudaStreamQuery means the stream has completed. Any other value means the stream has not completed. This function translates a cudaSuccess return value into a C "true" value (i.e. non-zero) and any non-cudaSuccess return value into a C "false" value (i.e. zero). Maybe NVIDIA could have invented yet another constant such as cudaStreamComplete to indicate that the stream has completed, but they opted not to.

david-macmahon commented 2 years ago

The check around line 1742 could be made better for sure. I think the enclosing function, rawspec_reset_integration should return the number of output buffers that were not successfully reset/cleared. So if they all are reset OK, rawspec_reset_integration will return 0 for success. If the clearing of one buffer fails, the return value should be incremented, but the remaining output buffers should still be (attempted to be) cleared. Any non-zero return from rawspec_reset_integration would indicate problems and the larger the value, the more buffers had problems being cleared.

texadactyl commented 2 years ago

Thanks for the line 1760 explanation. Understand.

Line 1742

I added an unnecessary call to rawspec_reset_integration near the end of rawspec.c regarding issue #39.

In rawspec.c, the only other place that rawspec_reset_integration is called is in the rawspec.c context of "Same as previous stem, just reset for new integration". Note that the return code is ignored in rawspec.c. Shouldn't rawspec exit with an error status if there is any sort of GPU failure? My experience is that once hardware or firmware misbehaves, it is best to go through a hard or soft reset.

FWIW, that is a very very long if-then-else starting at line 555 - challenging to read with only 2 characters of indentation!

UCBerkeleySETI / rawspec

cuda API error returns should always be detected and processed correctly #40