Not all events cleared for simple saxpy example using level zero backend

karlwessel commented 4 months ago

When I compile a simple cuda example implementing saxpy it executes correctly but at the end I get error messages:

CHIP warning [TID 111264] [1710401408.825743522] : Immediate command lists are not supported on this device. Some tests likely to fail.
Max error: 0.000000
CHIP error [TID 111267] [1710401410.000226461] : CHIPEventMonitorLevel0 stop was called but not all events have been cleared. Timeout of 0 seconds has been reached.
CHIP error [TID 111267] [1710401410.000322247] : Uncollected Backend->Events: 0x5f4c8463bfa0 memCopy AssocCmdList 0x0
CHIP error [TID 111267] [1710401410.000383959] : Uncollected Backend->Events: 0x5f4c8463c090 memCopy AssocCmdList 0x0
CHIP error [TID 111267] [1710401410.000448316] : Uncollected Backend->Events: 0x5f4c84760da0 memCopy AssocCmdList 0x0
CHIP warning [TID 111264] [1710401410.000811568] : CHIPEventLevel0 objects still exist at the time of EventPool destruction
CHIP warning [TID 111264] [1710401410.000994071] : CHIPEventLevel0 objects still exist at the time of EventPool destruction
CHIP warning [TID 111264] [1710401410.001137852] : CHIPEventLevel0 objects still exist at the time of EventPool destruction
CHIP warning [TID 111264] [1710401410.001315387] : Backend->Events still exist at the time of Context destruction...

This is with the latest main branch. I am not sure if the warning at the beginning has something to do with it. Is there something I can do to get rid of these error messages?

The cuda code I use is:

#include <stdio.h>
#include <math.h>
#include <cuda_runtime.h>

__global__
void saxpy(int n, float a, float *x, float *y)
{
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n) y[i] = a*x[i] + y[i];
}

int main(void)
{
  int N = 1<<20;
  float *x, *y, *d_x, *d_y;
  x = (float*)malloc(N*sizeof(float));
  y = (float*)malloc(N*sizeof(float));

  cudaMalloc(&d_x, N*sizeof(float)); 
  cudaMalloc(&d_y, N*sizeof(float));

  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
  cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);

  // Perform SAXPY on 1M elements
  saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y);

  cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);

  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = fmax(maxError, abs(y[i]-4.0f));
  printf("Max error: %f\n", maxError);

  cudaFree(d_x);
  cudaFree(d_y);
  free(x);
  free(y);
}

pvelesko commented 4 months ago

I have a PR with a fix for this but it needs more testing. It's benign so you can you ignore it and just set CHIP_LOGLEVEL=crit

karlwessel commented 4 months ago

Ok, thank you!

I was a bit annoyed by the few additional seconds the application needs when exiting (waiting for the events I guess). But if that is going to be fixed I can live with it until then, no problem.

pvelesko commented 4 months ago

There is no extra waiting happening right now so that won't get fixed. Which test case are you talking about?

karlwessel commented 4 months ago

Never mind then. There are a tons of possible reasons for the lag I noticed on my side of the code. I have to check that first :).

karlwessel commented 4 months ago

Thank you for the fix!

CHIP-SPV / chipStar

Not all events cleared for simple saxpy example using level zero backend #801