fmihpc / vlasiator

Vlasiator - ten letters you can count on
https://www.helsinki.fi/en/researchgroups/vlasiator
Other
48 stars 39 forks source link

Investigate space-filling curves #119

Closed ohannuks closed 6 years ago

ohannuks commented 9 years ago

Currently we do sorting in velocity space for blocks before acceleration/translation to reduce cache misses.

One of the known ways to cache misses is the space-filling curves. This is implemented in most modern mesh libraries ( http://www.sci.utah.edu/publications/Dub2014a/Dubey_JPDC2014.pdf ) and there are loads of different algorithms for implementing it: Hilbert w/ fast sorting, Morton, Piano Hilbert at least seem to be widely used.

Whether the space filling curves in velocity space help with performance in Vlasiator is still an open question and it should be investigated. The first step would be to use performance analysis tools (Papi, CrayPat, ..) to analyze cache performance and determine whether we have any space for improvement.

If we need to improve the performance, the next step would be to look for different space filling curves and choose the one that fits our needs (algorithm should be fast and it should work with unstructured meshes). We could also look for existing implementations.

galfthan commented 9 years ago

It could have a small performance improvement, but I would not expect more than a few percent.

Last time I looked (not so long ago) at caches with craypat the cache metrics are actually pretty good. L1 was over 99%. As practice Otto you could run Vlasiator using Craypat catching the different HW groups, and report the results here.

Velocity space is blocked into groups of 64 cells, and then we further on collect the needed blocks into temporary arrays on which we do a lot of floating point operations. This is quite cache friendly. The less friendly places is in updating the target values. This is more of random access but still pretty localized.

In velocity space we do everything in columns and reuse all of the data we load very well. In spatial space there is an inefficiency in memory handling. For each block we fetch its 4 neighbors. But we do not actually try to reuse these, but simply take the next block in the list (which is pretty random). It is here where the sorting could help the most since then the next block would be close in space.

iljah commented 9 years ago

unstructured meshes). We could also look for existing software implementations.

This comes to mind: https://github.com/fmihpc/sfcpp

ursg commented 6 years ago

Especially with AMR on the horizon, the block numbering scheme will anyway not be amenable to this sort of curves. So let's leave this on the "good idea, but no" - pile.

iljah commented 6 years ago

Which blocks are you referring to?

iljah commented 6 years ago

Actually thinking about this more, and assuming this is about velocity blocks of one cell, I vaguely remember that I tried this back when @sandroos was leading code development. Ordering velocity blocks in vector based on SFC ordering instead of regular didn't help.

galfthan commented 6 years ago

yes, unfortunately vlasov solvers are not memory bandwidth bound

tkoskela commented 6 years ago

Out of curiosity mostly, if we are not BW bound, then what are we bound by?

iljah commented 6 years ago

Yeah on current hardware shouldn't there be like > 1000 flop/byte to be CPU bound?

galfthan commented 6 years ago

Well, if we had a roofline model expert then one could analyze where the solvers are ;) Probably not too good looking there I would assume. There are for sure things that could be improved in the throughput of the solvers, avoiding stalls for various resources etc. There are also a lot of flops.

tkoskela commented 6 years ago

Not quite, at around 10 flops per byte you can be compute bound. But that is still 80 flops per double. It would be interesting to do roofline analysis for the solvers, I was just wondering if you knew already where they are :)