NVlabs / NVBit

199 stars 18 forks source link

NVBIT execution error when compiled CUDA 11.1.105 #37

Closed amrelhelw closed 3 years ago

amrelhelw commented 3 years ago

I tried to compile and execute NVBIT 1.5.2 /1.5.1 with CUDA 11.1. The compilation is successful but when i try to launch any program (as example vectorAdd in testapps) with inst_count.so tool i receive the following error:

Cuda error in function '(vecAdd<<<gridSize, blockSize>>>(d_a, d_b, d_c, n))' file 'vectoradd.cu' in line 81 : initialization error.

I am using Quadro P5000.

ovilla commented 3 years ago

I assume vectorAdd runs without problems without NVBit, right? Which driver version are you using? We are going to try and see if we can reproduce on our side but we are not sure yet when we can get to it. Any other information that can help us reproduce is really appreciated.

amrelhelw commented 3 years ago

Thanks for your quick reply. vectorAdd does not run without NVBit. I think this is related to CUDA version. I am using NVCC V11.1.105. Same code runs successfully when i compile with NVCC V10.0.130 but i have to use older version of NVBit (v1.1)

I copied and past vectoradd.cu (copied from test-apps folder at NVBit). The error produced is initialization error at CUDA_SAFECALL. nvcc -arch=sm_61 -O3 --maxrregcount=2 --ptxas-options -v vectorAdd.cu -o VECADD

include

include

include

define CUDA_SAFECALL(call) \

{                                                                       \
    call;                                                               \
    cudaError err = cudaGetLastError();                                 \
    if (cudaSuccess != err) {                                           \
        fprintf(                                                        \
            stderr,                                                     \
            "Cuda error in function '%s' file '%s' in line %i : %s.\n", \
            #call, __FILE__, __LINE__, cudaGetErrorString(err));        \
        fflush(stderr);                                                 \
        exit(EXIT_FAILURE);                                             \
    }                                                                   \
}

// CUDA kernel. Each thread takes care of one element of c global void vecAdd(double a, double b, double c, int n) { // Get our global thread ID int id = blockIdx.x blockDim.x + threadIdx.x;

// Make sure we do not go out of bounds
if (id < n) c[id] = a[id] + b[id];

}

int main(int argc, char *argv[]) { // Size of vectors int n = 100000; if (argc > 1) n = atoi(argv[1]);

// Host input vectors
double *h_a;
double *h_b;
// Host output vector
double *h_c;

// Device input vectors
double *d_a;
double *d_b;
// Device output vector
double *d_c;

// Size, in bytes, of each vector
size_t bytes = n * sizeof(double);

// Allocate memory for each vector on host
h_a = (double *)malloc(bytes);
h_b = (double *)malloc(bytes);
h_c = (double *)malloc(bytes);

// Allocate memory for each vector on GPU
cudaMalloc(&d_a, bytes);
cudaMalloc(&d_b, bytes);
cudaMalloc(&d_c, bytes);
int i;
// Initialize vectors on host
for (i = 0; i < n; i++) {
    h_a[i] = sin(i) * sin(i);
    h_b[i] = cos(i) * cos(i);
    h_c[i] = 0;
}

// Copy host vectors to device
cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, h_b, bytes, cudaMemcpyHostToDevice);
cudaMemcpy(d_c, h_c, bytes, cudaMemcpyHostToDevice);

int blockSize, gridSize;

// Number of threads in each thread block
blockSize = 1024;

// Number of thread blocks in grid
gridSize = (int)ceil((float)n / blockSize);

// Execute the kernel
CUDA_SAFECALL((vecAdd<<<gridSize, blockSize>>>(d_a, d_b, d_c, n)));

// Copy array back to host
cudaMemcpy(h_c, d_c, bytes, cudaMemcpyDeviceToHost);

// Sum up vector c and print result divided by n, this should equal 1 within
// error
double sum = 0;
for (i = 0; i < n; i++) sum += h_c[i];
printf("Final sum = %f; sum/n = %f (should be ~1)\n", sum, sum / n);

// Release device memory
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);

// Release host memory
free(h_a);
free(h_b);
free(h_c);

return 0;

}

ovilla commented 3 years ago

If vectorAdd does not work without NVBit when compiled with CUDA V11.1.105, then there is something wrong on the system.

Also compilation of an NVBit's tool requires CUDA >= 10.2, but the target application itself can be compiled with another CUDA version less or equal than what used to compile the NVBit's tool.

For instance, have you tried compiling the vectorAdd (or any other target application) with NVCC V10.0.130 and then compile the NVBit tool (instr_count.so for instance) with NVCC V11.1.105? That should work in principle.

However I am afraid your system has something that is not right, given the fact you can't run vectorAdd without NVBit when compiled with CUDA V11.1.105. Have you tried with CUDA V11.0 or any other V11 version?

amrelhelw commented 3 years ago

I have checked my machine and i need to update GPU driver so that CUDA V11.1.105 could work properly. Everything is working. Thanks.