Closed ohannuks closed 6 years ago
It could have a small performance improvement, but I would not expect more than a few percent.
Last time I looked (not so long ago) at caches with craypat the cache metrics are actually pretty good. L1 was over 99%. As practice Otto you could run Vlasiator using Craypat catching the different HW groups, and report the results here.
Velocity space is blocked into groups of 64 cells, and then we further on collect the needed blocks into temporary arrays on which we do a lot of floating point operations. This is quite cache friendly. The less friendly places is in updating the target values. This is more of random access but still pretty localized.
In velocity space we do everything in columns and reuse all of the data we load very well. In spatial space there is an inefficiency in memory handling. For each block we fetch its 4 neighbors. But we do not actually try to reuse these, but simply take the next block in the list (which is pretty random). It is here where the sorting could help the most since then the next block would be close in space.
unstructured meshes). We could also look for existing software implementations.
This comes to mind: https://github.com/fmihpc/sfcpp
Especially with AMR on the horizon, the block numbering scheme will anyway not be amenable to this sort of curves. So let's leave this on the "good idea, but no" - pile.
Which blocks are you referring to?
Actually thinking about this more, and assuming this is about velocity blocks of one cell, I vaguely remember that I tried this back when @sandroos was leading code development. Ordering velocity blocks in vector based on SFC ordering instead of regular didn't help.
yes, unfortunately vlasov solvers are not memory bandwidth bound
Out of curiosity mostly, if we are not BW bound, then what are we bound by?
Yeah on current hardware shouldn't there be like > 1000 flop/byte to be CPU bound?
Well, if we had a roofline model expert then one could analyze where the solvers are ;) Probably not too good looking there I would assume. There are for sure things that could be improved in the throughput of the solvers, avoiding stalls for various resources etc. There are also a lot of flops.
Not quite, at around 10 flops per byte you can be compute bound. But that is still 80 flops per double. It would be interesting to do roofline analysis for the solvers, I was just wondering if you knew already where they are :)
Currently we do sorting in velocity space for blocks before acceleration/translation to reduce cache misses.
One of the known ways to cache misses is the space-filling curves. This is implemented in most modern mesh libraries ( http://www.sci.utah.edu/publications/Dub2014a/Dubey_JPDC2014.pdf ) and there are loads of different algorithms for implementing it: Hilbert w/ fast sorting, Morton, Piano Hilbert at least seem to be widely used.
Whether the space filling curves in velocity space help with performance in Vlasiator is still an open question and it should be investigated. The first step would be to use performance analysis tools (Papi, CrayPat, ..) to analyze cache performance and determine whether we have any space for improvement.
If we need to improve the performance, the next step would be to look for different space filling curves and choose the one that fits our needs (algorithm should be fast and it should work with unstructured meshes). We could also look for existing implementations.