v3 spec makes confusing performance analysis

Norbo11 commented 5 years ago

After implementing V3 we reach the "Is it fast?" section which does some initial analysis on the performance.

First we are told to run:

time (bin/make_world | bin/step_world 0.1 1 > /dev/null)
time (bin/make_world | bin/$USER/step_world_v3_opencl 0.1 1 > /dev/null)

On my machine, the OpenCL version is indeed a little bit slower. This makes sense, as we're only making 1 step, and the realised speedup is not enough to counteract the overheads of initialising OpenCL. All of this makes sense.

Then we are told that there is an extra overhead due to the formatting of world data. We are told to run:

time (bin/make_world 1000 0.1   0  > /dev/null)   # text format
time (bin/make_world 1000 0.1   1  > /dev/null)   # binary format

And indeed we see that the binary format is of course, a lot quicker to produce. However, the spec then says that "I would recommend using the binary format when not debugging, as otherwise your improvements in speed will be swamped by conversions." Surely, this is an incorrect claim. The conversion only happens once at the beginning of every run. As soon as our StepWorld function is invoked, there is no file reading involved. So all we're really doing is shaving off ~1 second from the total, which may be significant only for small-ish runs.

The more confusing part comes next. We are told to run the following commands:

time (cat /tmp/world.bin | bin/step_world 0.1 0  1 > /dev/null) 
time (cat /tmp/world.bin | bin/$USER/step_world_v3_opencl 0.1 0  1 > /dev/null)
time (cat /tmp/world.bin | bin/step_world 0.1 1  1 > /dev/null) 
time (cat /tmp/world.bin | bin/$USER/step_world_v3_opencl 0.1 1  1 > /dev/null)

This would allow us to compare time taken to execute 1 vs 0 steps in both versions of the program, thus computing the "marginal cost of each frame". I disagree with this for the following reasons:

There is a fair amount of noise around the timing of each command, so for this marginal cost to make any sense we would have to average the difference across a large number of samples;
By the same logic which was made at the beginning, 1 step is not enough to overcome the overhead of running OpenCL. Surely the benefits are only realised after the number of steps surpasses a particular point. For example, setting n = 1000 and comparing the two on my machine shows that the sequential version takes 44s while the OpenCL version takes 3.5s.

Therefore, how can the spec claim that "the GPU time per frame will be similar to or, more likely, quite a bit slower than the original CPU"? It may be the case that our OpenCL implementation isn't fully optimised yet (due to inefficient memory accesses, etc.) but it's very far from slower.

jjd06 commented 5 years ago

This should feed nicely into our discussion of critical work next week -- particularly Amdahl and Gustafson's laws.

step_world may only convert the data once, but can that work be parallelised? Even if it can, will you parallelise it? If not, it will become the bottleneck as you accelerate the rest of your program. Further complicating this is how the binary-text conversion code scales relative to the (parallel) remainder -- which will behave better as the problem size increases?

For the "marginal cost," the OpenCL setup cost is factored in even when running for zero steps, so the difference between

time (cat /tmp/world.bin | bin/$USER/step_world_v3_opencl 0.1 0 1 > /dev/null) time (cat /tmp/world.bin | bin/$USER/step_world_v3_opencl 0.1 1 1 > /dev/null)

won't include those. It's a bit surprising that an unoptimised GPU implementation (particularly with back-and-forth memory transfers per kernel invocation) is that much faster than a CPU version. Are you sure that you haven't raced ahead with optimisation of the GPU version or running Prime95 while timing your CPU version?

ashleydavies commented 5 years ago

FWIW, my v3 GPU version is also faster without optimisations (17 vs 58 seconds) and my CPU is under low load outside of the application

jjd06 commented 5 years ago

Interesting... apparently relative speeds/bandwidths have changed quite a bit since the spec was written!

HPCE / hpce-2018-cw3

v3 spec makes confusing performance analysis #89