Closed malharjajoo closed 6 years ago
We were having issues with what we thought had to do with precision as the output was generally right. Turns out I just had a very cheeky bug in my kernel, the consequence of this being that some pixels in the output weren't being computed correctly. Now that that bug is fixed, everything works fine. I saw that you posted on the other issue.
The reason why -cl-fp32-correctly-rounded-divide-sqrt
seemed to fix our problems was that the bug I've mentioned had to do with a float division. Improving the accuracy of that division made the bug less likely to happen. All of the operations in Gaussian Blur are double precision ops, so the above flag shouldn't change anything (Note fp32 vs double which is fp64).
You most probably have a bug in your code, as did I :)
Echoing what @guigzzz suggests, I would look deeply at the values that each implementation
produces. A suggestion is to dump the pixels before rounding from both the GPU and the
CPU, along with the sum of the coefficients. You can do this with printf
from inside the kernel.
If you then do the same from the CPU, you can then compare them directly (e.g. visually).
For example, if you printed out:
SOMETHINGUNIQUE, x, y, pixel, coeffSum
for every pixel, you could then grep out the lines you're interested in by searching for SOMETHINGUNIQUE, and convert to csv. You can then import to a pivot table or matlab, and do a direct comparison at each point. Looking at the software coeffs subtracted from the hardware coeffs might well tell you something.
If you need to go further, you could dump:
x, y, dx, dy, coeff
though you'll want to use a relatively small image.
HI,
Thank you everyone. I managed to fix it for now.
Hi,
I am aware that there have been a few issues on this topic previously, but they don't seem to provide a conclusion. I have tried using the openCL compilation flags mentioned in ( #35 ) but they don't seem to improve the results.
I have a sequential implementation ( that works perfectly ) and thought a similar approach might work on GPU but the reference/CPU and GPU results differ by a lot ( >> 2).
Am, I missing something really obvious ?