bryancatanzaro / damascene

Efficient, High Quality Image Contour Detection
Other
25 stars 9 forks source link

CUDA Error: an illegal memory access was encountered build/parabola.cu 88 #2

Open prasannavk opened 9 years ago

prasannavk commented 9 years ago

Hi,

I am trying to reproduce results from the paper related to gPb with GPU. The runtime error I encounter while trying to generate the boundary image is attached in the end. (1) I am running cuda 7.0. Could that be an issue? Should I roll back to cuda 6.0? In terms of hardware, I tried it on a Quadro K5000. (2) // copy pixels CUDA_SAFE_CALL( cudaMemcpy2DToArray(cuda_parabola_pixels, 0, 0, devPixels, border_width*sizeof(int), border_width*sizeof(int), border_height*norients, cudaMemcpyDeviceToDevice) ); is the line in the parabola.cu file. As I keep removing these operations, similar errors occur further down in the code.

(3) Is there any expectation for the image used in terms of size as input that is causing the memory to overflow and cause an illegal access?

(4) I followed the build instructions in the README, and added 'm' and 'stdc++' to the linker list in the SConstruct file for Ubuntu 14.04. And then ran the scons command. Is there any additional build step?

Any pointers regarding this would be helpful. Thanks!

$ ./contour ~/Downloads/3096_roi.ppm 
Using cuda device 0: Quadro K5000
Processing: /home/user/Downloads/3096_roi.ppm, output in /home/user/Downloads/3096_roiPb.pgm and /home/user/Downloads/3096_roi.pb

 Eig 9 Tol 0.001000 Texton 1Image found: 481 x 321 pixels
Available 2765070336 bytes on GPU
>+< rgbUtoGrayF | 0.049504 | ms
Convolving
Beginning kmeans with 15 max iterations
    Changes: 172077
    Changes: 115379
    Changes: 52198
    Changes: 40522
    Changes: 27055
    Changes: 24549
    Changes: 23285
    Changes: 20741
    Changes: 19376
    Changes: 17479
    Changes: 15946
    Changes: 15384
    Changes: 15246
    Changes: 15009
    Changes: 14467
    15 iterations until termination
Kmeans completed
>+< texton | 262.094238 | ms
>+< rgbUtoLab3F | 0.931840 | ms
>+< normalizeLab | 0.058752 | ms
>+< mirrorImage | 0.769760 | ms
Beginning Local cues computation
CUDA Error: an illegal memory access was encountered build/parabola.cu 88
bryancatanzaro commented 9 years ago

Hi Prasanna - Unfortunately, we haven't had time to maintain this code. It worked with CUDA 2.0 and GPUs from 2009, but I have not gotten it running on anything modern. I'm sure there are bugs in the code, but just don't have time to find them. =(

I have seen this bug as well - just don't have a fix for it at this time.

prasannavk commented 9 years ago

Ok, good to know. Thanks!

hyenal commented 8 years ago

Did you try to change the SMVERSIONFLAGS in common.mk to sm_20 ? Actually support for sm_12 architecture was dropped in CUDA 7.0 so it might a reason why you get illegal memory access

acherunilam commented 7 years ago

I've changed all instances of SMVERSIONFLAGS to sm_35 since I'm running it on a Tesla K20. But I too face the same error:

Using cuda device 1: Tesla K20c
Processing: damascene/polynesia.ppm, output in damascene/polynesiaPb.pgm and damascene/polynesia.pb

 Eig 9 Tol 0.001000 Texton 1Image found: 321 x 481 pixels
Available 246022144 bytes on GPU
>+< rgbUtoGrayF | 0.729000 | ms
Convolving
Beginning kmeans
    Changes: 162604
    Changes: 83239
    Changes: 53176
    Changes: 40912
    Changes: 33108
    Changes: 25228
    Changes: 22345
    Changes: 19411
    Changes: 1971686769
    Changes: -153835340
    9 iterations until termination
Kmeans completed
>+< texton | 375.548004 | ms
>+< rgbUtoLab3F | 1.990000 | ms
>+< normalizeLab | 0.015000 | ms
>+< mirrorImage | 1.276000 | ms
Beginning Local cues computation
CUDA error at parabola.cu:58 code=77(cudaErrorIllegalAddress) "cudaMemcpy2DToArray(cuda_parabola_pixels, 0, 0, devPixels, border_width*sizeof(int), border_width*sizeof(int), border_height*norients, cudaMemcpyDeviceToDevice)"

Any suggestions?

envp commented 7 years ago

I ran cuda-memcheck on the executable, here's the output if anyone else is looking at this. It seems that computeGradients makes invalid writes into __shared__ memory

memerrors.txt

Edit: I found the issue. The computeGradient kernel was being launched with insufficient parameters. We fixed this issue by passing a third kernel launch parameter indicating the amount of shared memory to be allocated for the kernel. We For getting it to run on K20c we had to change lines 633-637 in localcues/rotate.cu

The original looked like so

...
    computeGradient<48, 25, true, false><<<gridDim, blockDim>>>(width, height, width * height, border, rotatedWidth, topNorm, bottomNorm, kernelRadius, kernelLength, devIntegrals, integralImagePitchInInts, devGradientA);
    computeGradient<48, 25, true, true><<<gridDim, blockDim>>>(width, height, width * height, border, rotatedWidth, leftNorm, rightNorm, kernelRadius, kernelLength, devIntegrals, integralImagePitchInInts, devGradientB);
  } else {
    computeGradient<32, 32, false, false><<<gridDim, blockDim>>>(width, height, width * height, border, rotatedWidth, topNorm, bottomNorm, 0, 0, devIntegrals, integralImagePitchInInts, devGradientA);
    computeGradient<32, 32, false, true><<<gridDim, blockDim>>>(width, height, width * height, border, rotatedWidth, leftNorm, rightNorm, 0, 0, devIntegrals, integralImagePitchInInts, devGradientB);
...

This is to be changed to:

...
    // Internally computeGradient requires allocation of 3 ___shared___ float arrays
    // Each of which contains nthreads * UNROLL (48 * 4 ) elements
    size_t sharedMemorySize = 48 * 3 * UNROLL * sizeof(float);
    computeGradient<48, 25, true, false><<<gridDim, blockDim, sharedMemorySize>>>(width, height, width * height, border, rotatedWidth, topNorm, bottomNorm, kernelRadius, kernelLength, devIntegrals, integralImagePitchInInts, devGradientA);
    computeGradient<48, 25, true, true><<<gridDim, blockDim, sharedMemorySize>>>(width, height, width * height, border, rotatedWidth, leftNorm, rightNorm, kernelRadius, kernelLength, devIntegrals, integralImagePitchInInts, devGradientB);
  } else {
    // Internally computeGradient requires allocation of 3 ___shared___ float arrays
    // Each of which contains nthreads * UNROLL (32 * 4 ) elements
    size_t sharedMemorySize = 32 * 3 * UNROLL * sizeof(float);
    computeGradient<32, 32, false, false><<<gridDim, blockDim, sharedMemorySize>>>(width, height, width * height, border, rotatedWidth, topNorm, bottomNorm, 0, 0, devIntegrals, integralImagePitchInInts, devGradientA);
    computeGradient<32, 32, false, true><<<gridDim, blockDim, sharedMemorySize>>>(width, height, width * height, border, rotatedWidth, leftNorm, rightNorm, 0, 0, devIntegrals, integralImagePitchInInts, devGradientB);
...

The values may be refactored further, and have been stated explicitly above for the sake of clarity.

References: [1] https://stackoverflow.com/questions/25500961/using-shared-memory-in-cuda-gives-memory-write-error