Open prasannavk opened 9 years ago
Hi Prasanna - Unfortunately, we haven't had time to maintain this code. It worked with CUDA 2.0 and GPUs from 2009, but I have not gotten it running on anything modern. I'm sure there are bugs in the code, but just don't have time to find them. =(
I have seen this bug as well - just don't have a fix for it at this time.
Ok, good to know. Thanks!
Did you try to change the SMVERSIONFLAGS
in common.mk to sm_20 ?
Actually support for sm_12 architecture was dropped in CUDA 7.0 so it might a reason why you get illegal memory access
I've changed all instances of SMVERSIONFLAGS
to sm_35
since I'm running it on a Tesla K20. But I too face the same error:
Using cuda device 1: Tesla K20c
Processing: damascene/polynesia.ppm, output in damascene/polynesiaPb.pgm and damascene/polynesia.pb
Eig 9 Tol 0.001000 Texton 1Image found: 321 x 481 pixels
Available 246022144 bytes on GPU
>+< rgbUtoGrayF | 0.729000 | ms
Convolving
Beginning kmeans
Changes: 162604
Changes: 83239
Changes: 53176
Changes: 40912
Changes: 33108
Changes: 25228
Changes: 22345
Changes: 19411
Changes: 1971686769
Changes: -153835340
9 iterations until termination
Kmeans completed
>+< texton | 375.548004 | ms
>+< rgbUtoLab3F | 1.990000 | ms
>+< normalizeLab | 0.015000 | ms
>+< mirrorImage | 1.276000 | ms
Beginning Local cues computation
CUDA error at parabola.cu:58 code=77(cudaErrorIllegalAddress) "cudaMemcpy2DToArray(cuda_parabola_pixels, 0, 0, devPixels, border_width*sizeof(int), border_width*sizeof(int), border_height*norients, cudaMemcpyDeviceToDevice)"
Any suggestions?
I ran cuda-memcheck on the executable, here's the output if anyone else is looking at this. It seems that computeGradients makes invalid writes into __shared__
memory
Edit: I found the issue. The computeGradient
kernel was being launched with insufficient parameters. We fixed this issue by passing a third kernel launch parameter indicating the amount of shared memory to be allocated for the kernel. We For getting it to run on K20c we had to change lines 633-637 in localcues/rotate.cu
The original looked like so
...
computeGradient<48, 25, true, false><<<gridDim, blockDim>>>(width, height, width * height, border, rotatedWidth, topNorm, bottomNorm, kernelRadius, kernelLength, devIntegrals, integralImagePitchInInts, devGradientA);
computeGradient<48, 25, true, true><<<gridDim, blockDim>>>(width, height, width * height, border, rotatedWidth, leftNorm, rightNorm, kernelRadius, kernelLength, devIntegrals, integralImagePitchInInts, devGradientB);
} else {
computeGradient<32, 32, false, false><<<gridDim, blockDim>>>(width, height, width * height, border, rotatedWidth, topNorm, bottomNorm, 0, 0, devIntegrals, integralImagePitchInInts, devGradientA);
computeGradient<32, 32, false, true><<<gridDim, blockDim>>>(width, height, width * height, border, rotatedWidth, leftNorm, rightNorm, 0, 0, devIntegrals, integralImagePitchInInts, devGradientB);
...
This is to be changed to:
...
// Internally computeGradient requires allocation of 3 ___shared___ float arrays
// Each of which contains nthreads * UNROLL (48 * 4 ) elements
size_t sharedMemorySize = 48 * 3 * UNROLL * sizeof(float);
computeGradient<48, 25, true, false><<<gridDim, blockDim, sharedMemorySize>>>(width, height, width * height, border, rotatedWidth, topNorm, bottomNorm, kernelRadius, kernelLength, devIntegrals, integralImagePitchInInts, devGradientA);
computeGradient<48, 25, true, true><<<gridDim, blockDim, sharedMemorySize>>>(width, height, width * height, border, rotatedWidth, leftNorm, rightNorm, kernelRadius, kernelLength, devIntegrals, integralImagePitchInInts, devGradientB);
} else {
// Internally computeGradient requires allocation of 3 ___shared___ float arrays
// Each of which contains nthreads * UNROLL (32 * 4 ) elements
size_t sharedMemorySize = 32 * 3 * UNROLL * sizeof(float);
computeGradient<32, 32, false, false><<<gridDim, blockDim, sharedMemorySize>>>(width, height, width * height, border, rotatedWidth, topNorm, bottomNorm, 0, 0, devIntegrals, integralImagePitchInInts, devGradientA);
computeGradient<32, 32, false, true><<<gridDim, blockDim, sharedMemorySize>>>(width, height, width * height, border, rotatedWidth, leftNorm, rightNorm, 0, 0, devIntegrals, integralImagePitchInInts, devGradientB);
...
The values may be refactored further, and have been stated explicitly above for the sake of clarity.
References: [1] https://stackoverflow.com/questions/25500961/using-shared-memory-in-cuda-gives-memory-write-error
Hi,
I am trying to reproduce results from the paper related to gPb with GPU. The runtime error I encounter while trying to generate the boundary image is attached in the end. (1) I am running cuda 7.0. Could that be an issue? Should I roll back to cuda 6.0? In terms of hardware, I tried it on a Quadro K5000. (2)
// copy pixels CUDA_SAFE_CALL( cudaMemcpy2DToArray(cuda_parabola_pixels, 0, 0, devPixels, border_width*sizeof(int), border_width*sizeof(int), border_height*norients, cudaMemcpyDeviceToDevice) );
is the line in the parabola.cu file. As I keep removing these operations, similar errors occur further down in the code.(3) Is there any expectation for the image used in terms of size as input that is causing the memory to overflow and cause an illegal access?
(4) I followed the build instructions in the README, and added 'm' and 'stdc++' to the linker list in the SConstruct file for Ubuntu 14.04. And then ran the scons command. Is there any additional build step?
Any pointers regarding this would be helpful. Thanks!