TD13/PlaneWaveTurbulence performance

antoniusf commented 3 years ago

Hi,

I've just re-run some benchmarks for various magnetic fields, including TD13/PlaneWaveTurbulence, and something came up that I'd forgotten about before the merge. To make things short, there's a compiler flag you can enable that makes the optimized (SIMD) implementation about 25% faster, by my benchmarks.

To go a bit more into detail, there are certain processor instructions that not all processors support, because they are extensions to the core instruction set that have been added over the past few years. The optimized implementation itself uses some of these extensions (most notably, one called AVX), which allows it to operate on multiple numbers at the same time (this is also known as SIMD). This extension is not optional (for the optimized version) because I have had to manually tell the compiler which of the precise instructions to use, in order to make it work.

However, there is a second instruction set extension (this one is called FMA), which is simple enough that the compiler can apply it automatically. All that FMA does is allow the compiler to merge one multiplication and one addition together into a single instruction that is just as fast (in terms of throughput) as either just the multiply or the add. Doing a quick count, this replacement can be done in nine places in the inner TD13 loop, which is a significant portion of the overall amount of instructions in that loop. In other words, this kind of speedup does make sense.

the data

Benchmarks were obtained using the same method described in our paper (see page 7 in the preprint). In short, we do a simple test simulation (only PropCK with a few ten thousand particles, a trajectory length limit, and constant step size so all trajectories have the same number of steps), time it using the time utility and a best-of-3 system, and then divide the runtime by the total number of steps done to obtain the average amount of time per step. The table below lists these times (all in µs), along with the time needed by the sped-up code relative to the baseline, for different numbers of wavemodes. Note that we would expect the differences to be more pronounced for larger numbers of wavemodes, since for those, the time spent in the interpolation routine increasingly dominates the time spent everywhere else. (Though more involved effects might be at play here, too.)

modes	t_s, no FMA [µs]	t_s, FMA [µs]	relative
5	0.52	0.48	93.1%
10	0.56	0.51	90.7%
20	0.67	0.59	87.6%
40	0.86	0.73	84.4%
120	1.64	1.31	79.5%
360	4.00	3.04	76.1%
1200	12.37	9.17	74.1%
6000	59.86	45.30	75.7%

why not just add the flag by default?

As I said above, the instruction set extensions (like FMA and AVX) aren't supported by all processors. Requiring the use of AVX already restricts the set of processors that can make use of the optimized code, and requiring FMA as well would restrict it a bit further. (If you're on linux, you can check whether your cpu supports these extensions with the lscpu command; it should be listed in the flags section. I'd recommend using grep to make finding the relevant extension easier, like so: lscpu | grep avx to check for AVX support, or with fma for FMA support.)

conclusion

So, I guess my question would be, does it make sense to think about finding a solution for this? Like, maybe an extra configuration option in cmake, or something? On the one hand, it feels weird to leave a 25% speedup on the table if it can be achieved so cheaply. On the other hand, these 25% apply only to the field lookup routine, not other parts of the code. They would likely become less important once interaction modules start taking up time each step. Also, it would be an extra option that users might just not activate at all, or that could otherwise trip them up.

I'd love to hear your thoughts on this, but I know you are all super busy and this probably isn't the most important thing on your plate right now. So, feel free to just ignore this for now and come back later (or not at all). I simply wanted to get this into the system so I wouldn't forget about it.

lukasmerten commented 3 years ago

For me an cmake option sounds like a good way to go. We can set this by default to not use this additional optimization. Of course we should comment on this option in the 3.2 paper as well as in the documentation.

antoniusf commented 3 years ago

Okay, thanks for your feedback. I'll put it on my todo.

lukasmerten commented 3 years ago

I assume this issue is solved by #324. Please reopen if I am mistaken.

antoniusf commented 3 years ago

It is, thank you.

CRPropa / CRPropa3