Update CUDA runtime version to 8.0

acherunilam commented 7 years ago

I've migrated all API calls to the new CUDA SDK, and fixed the illegal memory access issue (#2). The program runs successfully on Ubuntu 14.04 with a Tesla K20 GPU. The run time is about 2 seconds when given the default Polynesia image.

For it to work, I'd to create two symlinks - one file at ./lib/libblas.so which points to /usr/lib/libblas.so.3 since BLAS wasn't being detected, and another directory at ./lib/acml which points to the location for the uncompressed ACML package. Additionally, I'd to also set the dynamic library path for it to detect ACML, by adding export LD_LIBRARY_PATH="$HOME/acml5.3.1/ifort64/lib/:$LD_LIBRARY_PATH" to my bashrc.

abcherun@gpus5:~/dev/damascene<update_cuda_runtime>$ ./bin/linux/release/damascene damascene/polynesia.ppm
Using cuda device 1: Tesla K20c
Processing: damascene/polynesia.ppm, output in damascene/polynesiaPb.pgm and damascene/polynesia.pb

Eig 9 Tol 0.001000 Texton 1
Image found: 321 x 481 pixels
Available 246022144 bytes on GPU
>+< rgbUtoGrayF | 0.733000 | ms
Convolving
Beginning kmeans
    Changes: 189059
    Changes: 82140
    Changes: 53169
    Changes: 40551
    Changes: 32645
    Changes: 25978
    Changes: 23274
    Changes: 19205
    Changes: -215195205
    8 iterations until termination
Kmeans completed
>+< texton | 330.005005 | ms
>+< rgbUtoLab3F | 1.980000 | ms
>+< normalizeLab | 0.015000 | ms
>+< mirrorImage | 0.988000 | ms
Beginning Local cues computation
>+<     Bgsmooth: | 13.919000 | ms
>+<     Bg: | 89.497002 | ms
>+<     Cgsmooth: | 28.606001 | ms
>+<     Cga: | 101.626999 | ms
>+<     Cgsmooth: | 28.462000 | ms
>+<     Cgb: | 100.380997 | ms
>+<     Tgsmooth: | 26.927000 | ms
>+<     Tg: | 79.556999 | ms
Completed Local cues
localcues time: 0.378023 seconds
>+< localcues | 378.028992 | ms
>+< combine | 1.131000 | ms

Max time: 0.001299 seconds
Oriented Max time: 0.000520 seconds
Solve time: 0.001848 seconds
>+< nonmaxsupression | 3.734000 | ms
Intervening contour completed
>+< intervene | 16.999001 | ms
Available 160432128 bytes on GPU
Can fit 220 iterations on GPU
lanczos iteration: 0
lanczos iteration: 100
lanczos iteration: 200
Screened Eigenvalues:
8.746726e-08 1.220966e-04 2.492128e-04 6.079139e-04 1.203148e-03 1.603140e-03 2.338310e-03 3.313430e-03 4.052723e-03
Eigenvalue: 8 has too large a residual 2.778898e-03
lanczos iteration: 300
lanczos iteration: 400
Screened Eigenvalues:
8.522160e-08 4.413837e-05 1.205699e-04 2.042214e-04 2.480132e-04 3.003473e-04 4.872549e-04 6.106148e-04 7.674522e-04
Eigenvalue: 8 has too large a residual 1.742319e-03
lanczos iteration: 500
lanczos iteration: 600
lanczos iteration: 700
lanczos iteration: 800
lanczos iteration: 900
Screened Eigenvalues:
6.426722e-08 4.418454e-05 1.360216e-04 2.063825e-04 2.494206e-04 2.955018e-04 4.003884e-04 4.659711e-04 5.760831e-04
Converged
nIterations = 1000
lanczos Iterations : 1.226173 seconds
Eigenvector calculation: 194.000000 microseconds
>+< generalizedEigensolve | 1243.982056 | ms
>+< spectralPb | 4.027000 | ms
>+< StartCalcGpb | 1.020000 | ms
Skeletonizing ...
    Iteration = 1, Image changed
    Iteration = 2, Image changed
    Iteration = 3, Image unchanged
CUDA Status : no error
>+< PostProcess | 4.072000 | ms
>+< Computation time: | 1.986825 | seconds

Do remember to squash the commits before merging :)

bryancatanzaro commented 7 years ago

This is awesome, thanks for doing this. Let me look at your work here... BTW - why do you want to squash the commits before merging?

hao-lh commented 7 years ago

Great work, and here are some concerns:

if we need to upgrade and test it on newest CUDA 8.0 platform.
From the result we can see most of time was spent on generalizedEigensolve/localcues/texton So can we work on these and see if there is still some improvement on this, see bryan's original paper: Efficient, High-Quality Image Contour Detection.

acherunilam commented 7 years ago

@bryancatanzaro I just thought these changes would be better represented in the log if they were presented as one single "Updated CUDA runtime to 8.0" rather than 16 separate "Updated \<module1>", "Updated \<module2>", etc. Most projects do squash before merging afaik, but it's up to the maintainer of the repo.

EDIT: Changed 7.5 to 8.0

acherunilam commented 7 years ago

@hao-lh Correction from my side - this code is compatible with runtime version 8.0, not 7.5. I shall fix the title of the pull request.

As for the scope for improvement, I thought this repo implemented everything that was discussed in "Efficient, High-Quality Image Contour Detection" by Catanzaro et al. Is there any specific optimization that you're referring to?

hao-lh commented 7 years ago

@adithyabenny Most of bryan's code was written more than five years ago, since parallel computation and CUDA is evolving actively these years, I was wondering if there exists methods for better performance, totally no offense for bryan's original algorithm and your work, just want this code runs faster :)

pkuCactus commented 6 years ago

Hi @adithyabenny , I use the code you commit, and still encounter the problem that cudaErrorIllegalAddress and the error message is CUDA error at lanczos.cu:217 code=77(cudaErrorIllegalAddress) "cudaMemcpy(devVector, d_aVectorQQ, nPixels * sizeof(float), cudaMemcpyDeviceToDevice)", could you help me, thanks a lot. i'm using titanx and ubuntu 14.04 adn i download the acml5.3.0. thanks.

pkuCactus commented 6 years ago

and here is the completely output ` ./bin/linux/release/damascene damascene/polynesia.ppm Using cuda device 2: GeForce GTX TITAN X Processing: damascene/polynesia.ppm, output in damascene/polynesiaPb.pgm and damascene/polynesia.pb

Eig 9 Tol 0.001000 Texton 1 Image found: 321 x 481 pixels Available 12672958464 bytes on GPU

+< rgbUtoGrayF | 0.244000 | ms Convolving Beginning kmeans Changes: 150860 Changes: 78580 Changes: 50898 Changes: 38726 Changes: 30185 Changes: 25232 Changes: 21250 Changes: 18425 Changes: -179543699 8 iterations until termination Kmeans completed +< texton | 237.464996 | ms +< rgbUtoLab3F | 2.259000 | ms +< normalizeLab | 0.016000 | ms +< mirrorImage | 0.858000 | ms Beginning Local cues computation +< Bgsmooth: | 7.079000 | ms +< Bg: | 35.658001 | ms +< Cgsmooth: | 18.371000 | ms +< Cga: | 44.307999 | ms +< Cgsmooth: | 18.410000 | ms +< Cgb: | 44.462002 | ms +< Tgsmooth: | 17.982000 | ms +< Tg: | 39.193001 | ms Completed Local cues localcues time: 0.178665 seconds +< localcues | 178.677994 | ms +< combine | 1.499000 | ms

Max time: 0.000406 seconds Oriented Max time: 0.000509 seconds Solve time: 0.000933 seconds

+< nonmaxsupression | 6.005000 | ms Intervening contour completed +< intervene | 7.725000 | ms Available 12572688384 bytes on GPU Can fit 18306 iterations on GPU lanczos iteration: 0 CUDA error at lanczos.cu:217 code=77(cudaErrorIllegalAddress) "cudaMemcpy(devVector, d_aVectorQQ, nPixels * sizeof(float), cudaMemcpyDeviceToDevice)" `

bryancatanzaro / damascene

Update CUDA runtime version to 8.0 #4