Open yuchen-w opened 9 years ago
Interestingly, I've also noticed the step_world_v3 running on my CPU is also significantly faster (>10x) than the original step_world function. Is this also supposed to happen?
Regarding the differences, what sort of differences are they, and how big? If they are of
the order of 10^-7 or so, it could be down to differences in the ordering of the
floating-point instructions. A way to check is to put in test-cases with exactly representable
inputs and constants (e.g. make_world 10 0.125 | step_world 0.125 100
, think small
binary powers), and check that the output is still exact.
Regarding the second: it is not guaranteed, but yes, hopefully the software OpenCL provider is faster than the original software. The Intel provider will hopefully be doing some SIMD optimisations, as well as using multiple threads, which could result in a 10x speed-up. It is sometimes possible that the software OpenCL provider is faster than a CPU, especially if they kernel has not been tuned for GPU friendly operation.
Just checked it with
make_world 10 0.125 | step_world 0.125 100
I'm getting for the original function:
0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
0.00000000 1.00000000 1.00000000 1.00000000 1.00000000 1.00000000 1.00000000 1.00000000 1.00000000 0.00000000
0.00000000 0.78210109 0.78014606 0.77879262 0.77604353 0.76561636 0.72710687 0.60089213 0.57189912 0.00000000
0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.26333833 0.25718218 0.00000000
0.00000000 0.00000019 0.00000149 0.00001212 0.00007913 0.00027144 0.00000000 0.09489445 0.09444368 0.00000000
0.00000000 0.00000004 0.00000017 0.00000000 0.00027144 0.00137941 0.00738860 0.02939555 0.03399836 0.00000000
0.00000000 0.00000001 0.00000002 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
my step_world_v5 would give:
0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
0.00000000 1.00000000 1.00000000 1.00000000 1.00000000 1.00000000 1.00000000 1.00000000 1.00000000 0.00000000
0.00000000 0.77876627 0.77679271 0.77544880 0.77275264 0.76247615 0.72428840 0.59851813 0.56974041 0.00000000
0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.26088086 0.25479814 0.00000000
0.00000000 0.00000017 0.00000137 0.00001131 0.00007459 0.00025802 0.00000000 0.09336054 0.09290666 0.00000000
0.00000000 0.00000003 0.00000015 0.00000000 0.00025802 0.00132408 0.00714971 0.02868212 0.03315769 0.00000000
0.00000000 0.00000000 0.00000002 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
I seem to be off by one on the red and blue colour channels:
My mac let's meet chose between two devices. Found 2 devices Device 0 : Intel(R) Core(TM) i5-4278U CPU @ 2.60GHz Device 1 : Iris
When I run it on device 0, my tests pass as expected. On device 1 there are more than 10^-7 units of error.
Is this similar to what you have, @yuchen-w ?
@darioml Yes, that is quite similar to what I had for my step_world_v3.
Although once I've progressed past v3, the error started manifesting itself on the CPU too. Both the CPU and the GPU would return the same result though
I ran the same functions as @yuchen-w:
./make_world 10 0.125 | ./step_world_v5_kernel 0.125 100 | ./render_world dump2.bmp
and
./make_world 10 0.125 | ./step_world 0.125 100 | ./render_world dump1.bmp
The differences between the two bitmaps is shown below, some of the pixels in my blue channel are also off by one
res(:,:,1) =
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
res(:,:,2) =
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
res(:,:,3) =
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 1 1 0
0 0 0 0 0 0 0 1 1 0
0 1 0 1 0 1 1 1 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
if i am not mistaken CPUs and GPUs have different precisions. Running the functions on the CPU should give pretty good precision close to what Dr Thomas has mentioned. However i do get larger errors when running on a GPU, a GPU could be intel integrated graphics or a Nvidia device. Although the differences in values that @yuchen-w got seems too large for that many number of steps to be cause by precision errors.
I've just encountered this weird bug where the output the step_world_v3 generates is mostly the same but not 100% the same as the output generated from the original step_world. This would only occur if I select the my nvidia card instead of my Intel CPU.
Can anyone else reproduce this in their code?