amueller / damascene-python-and-matlab-bindings

Python and matlab bindings for the Damascene CUDA implementation of gPB
13 stars 12 forks source link

python bindings #1

Open trsaunders opened 11 years ago

trsaunders commented 11 years ago

Hi Andreas,

with your latest changes I can get the lib to compile on ubuntu 12.10 / CUDA 5 with no changes other than the makefile paths - great!

but I'm unsure how to use the python bindings with cmake. could you provide some instructions please?

here's what I did:

~ mkdir build
~ cd build
~ cmake ..
~ make shared = 1
~ cd ../bindings
~ LD_LIBRARY_PATH="../build:$LD_LIBRARY_PATH" python example.py
Available 658432000out of 1073283072 bytes on GPU
terminate called after throwing an instance of 'std::runtime_error'
  what():  cudaMalloc((void**)p_devGrey, imageSize)invalid device ordinal
[1]    5268 abort (core dumped)  LD_LIBRARY_PATH="../build:$LD_LIBRARY_PATH" python example.py

any tips?

thanks!

amueller commented 11 years ago

I'm happy that finally changing to CMake and getting rid of the cutils did help someone :) For some reason it looks like the example used device 3 by default. I guess that was the problem. I just changed that to 0.

trsaunders commented 11 years ago

thanks for that, it gets to localcues calculation and dies! it seems something goes wrong with cuda because it ends up thinking there are 0 out of 0 bytes of memory available. Before I start delving into the code, can you confirm that the way I'm building it is correct?

thanks for the help!

amueller commented 11 years ago

I'm not sure you need the shared=1. It looks good to me. Usually as long as it builds it's fine. Getting it to build is usually the hard part. Did you change the device number / check out?

trsaunders commented 11 years ago

yes, with the latest code it completed the kmeans part but died in parabola.cu. I see there are .64.cu versions of each .cu file - should these be used in a 64bit system?

here's what happens when I try to run it (I added in the final memory available print out before it crashes)

➜  bindings git:(master) ✗ LD_LIBRARY_PATH="../build:$LD_LIBRARY_PATH" python example.py          
Available 721772544out of 1073283072 bytes on GPU
Convolving
Beginning kmeans
    Changes: 159813
    Changes: 78972
    Changes: 55780
    Changes: 43382
    Changes: 35724
    Changes: 32213
    Changes: 29152
    Changes: 25482
    Changes: 23540
    Changes: 21584
    Changes: 19722
    Changes: 17707
    Changes: 15382
    Changes: 13716
    Changes: 11906
    15 iterations until termination
Kmeans completed
Beginning Local cues computation
Available 0out of 0 bytes on GPU
terminate called after throwing an instance of 'std::runtime_error'
  what():  cudaMemcpy2DToArray(cuda_parabola_pixels, 0, 0, devPixels, border_width*sizeof(int), border_width*sizeof(int), border_height*norients, cudaMemcpyDeviceToDevice)unspecified launch failure
[1]    7585 abort (core dumped)  LD_LIBRARY_PATH="../build:$LD_LIBRARY_PATH" python example.py
➜  bindings gi
trsaunders commented 11 years ago

It seems that this call in rotate.cu::dispatchGradient_64() is killing cuda:

cudaMemcpyToSymbol(rectangleOffsets, hostRectangleOffsets, sizeof(int) * 8);

I added some prints:

  cuMemGetInfo(&availableMemory,&totalMemory );
  printf("before MemCpyToSymbol: %d\n", availableMemory);

  CUDA_SAFE_CALL(cudaMemcpyToSymbol(rectangleOffsets, hostRectangleOffsets, sizeof(int) * 8));

  cuMemGetInfo(&availableMemory,&totalMemory );
  printf("after: %d\n", availableMemory);

And this is the output:

before MemCpyToSymbol: 432480256
after: 0

So it seems that cudaMemcpyToSymbol goes wrong somehow and CUDA_SAFE_CALL does not catch it. Probably some nasty cuda problem. lovely!

amueller commented 11 years ago

Hm, maybe you don't have enough memory? Maybe I forgot to call thead sync to catch the error.... Sorry, I currently don't have time to investigate... (and the cuda code isn't mine any way ;)

trsaunders commented 11 years ago

I also don't have time to investigate, but unfortunately I need it, so I have no choice! :) it's a strange one, totalMemory also becomes 0 so its not just the device running out of memory. You're right that it isn't your code causing problems, I'll report back if I find a fix as it might help someone else.

amueller commented 11 years ago

Thanks :)

trsaunders commented 11 years ago

ahh I love digging into cuda kernels.

Anyway, I found a problem in the kernel rotate.cu::computeGradient

it defines several shared variables:

  __shared__ float aHistogram[nthreads*UNROLL];
  __shared__ float bHistogram[nthreads*UNROLL];
  __shared__ float temp[nthreads*UNROLL];

this is then called:

computeGradient<48, 25, true, false><<<gridDim, blockDim>>>(width, height, width * height, border, rotatedWidth, topNorm, bottomNorm, kernelRadius, kernelLength, devIntegrals, integralImagePitchInInts, devGradientA);

so each of those shared arrays ends up being of size 48*UNLROLL = 192

aHistogram and bHistogram seem to indexed correctly, but not temp. Take for example this chunk of code from the kernel:

    int bin = threadIdx.y * nthreads + threadIdx.x + kernelRadius;
// ...
    if (threadIdx.x < nthreads - blockDim.x) {
      temp[bin + blockDim.x] = 0;
    }

Running with the image in example.py I end up with the kernel being called with grid = [321, 1, 1] and block = [32 4 1], nthreads = 48, kernelRadius = 6.

the if then becomes: if(threadIdx.x < (48 - 32 = 16))

so the max threadIdx.x is 15. the max value of bin for this is int bin = 3*48 + 15 + 6 = 165 then when it tries to set temp[bin + blockDim.x = 165 + 32 = 197] the kernel crashes.

This bug is in the original damascene code - I'll try to find the correct array size and fix

amueller commented 11 years ago

Yeah, figuring out other peoples indices is the best!

Maybe try to get in touch with the original authors. Last time I found a problem they responded quite quickly.

trsaunders commented 11 years ago

now it seems to work nicely :)

My card only has 1GB of memory so I'm limited to images of around 0.25MP - have you tried with any larger images on a card with more memory? I'm trying to decide if a 4GB card could handle 720p images...

Screenshot from 2013-02-25 11:35:01

amueller commented 11 years ago

Sorry, haven't tried any larger images.

research2010 commented 11 years ago

Thank you, Mr. Andreas Mueller and Mr. Thomas Saunders. I also met this problem, and as Thomas Saunders have analysed, I just change the shared float temp[nthreads_UNROLL]; into shared float temp[nthreads_UNROLL+nbins]; then, the problem gone. But I am not sure if it is right.

samarth-robo commented 9 years ago

@trsaunders : I get the same bug as you, do you mind sharing the changes you made to fix it? Also, do you remember the CUDA toolkit/SDK and ACML versions you used to link against the code? Thanks!