RFC: Specialise summator_incompr for two- and three-dimensional problems

Recognising that the inner-most loops often often work on short two- or three-dimensional vectors, specialising the algorithm and more importantly the memory layout for those cases seems to be beneficial, most likely by improving cache locality (by removing the large stride from the input arrays) and by enabling auto-vectorization (due to the statically known alignment and loop bounds).

For a baseline x86-64 build (so up to SSE2), this yields:

Benchmarking field summate incompr: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 7.1s.
field summate incompr   time:   [705.88 ms 709.73 ms 716.60 ms]                                
                        change: [-29.646% -28.890% -27.982%] (p = 0.00 < 0.05)
                        Performance has improved.

where as using -Ctarget-cpu=native on my Zen+ machine (so up to AVX2) yields

Benchmarking field summate incompr: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 7.4s.
field summate incompr   time:   [740.73 ms 741.09 ms 741.44 ms]                                
                        change: [-28.009% -27.920% -27.852%] (p = 0.00 < 0.05)
                        Performance has improved.

showing one unfortunate issue with auto-vectorization: Enabling the wider vector registers makes the compiler work harder to actually fill them which often is not recuperated by the speed-ups of the operations themselves.

What I find interesting to note is that even the additional cost of allocating separate long vectors to hold the samples and positions as well to provide the result as an array is offset by the improvements in processing speed.

Finally, I also tried using the short vector types from various math- and game-oriented crates like vek, glam and nalgebra which however did not yield any improvements over the bespoke ShortVec<N> type added here as their manual vectorization efforts are often limited to the f32-based types more relevant to computer graphics. Hence I chose to avoid the additional dependency and made use of having full control over the API of `ShortVec´.

Does it make sense to create the fn inner and fn inner_short_vec via a macro? It seems like an obvious candidate, but for the time being, I cannot tell, if things get much messier with a macro in this case.

I am not sure I understand: Their bodies are different so I am not sure which repetitive part of the code a macro could reduce? Do you mean the function signatures? If so, I would suggest avoiding that as function signatures are the most important kind of contract in Rust programs. For example, they are not influenced by type interference so that all type checking is function-local.

If this technique is applied on a larger scale, it might make more sense to reign in the boiler plate. But for this PR here, I think explicitness is better, c.f. https://users.rust-lang.org/t/rust-koans/2408/4

I know of the crate smallvec. Did you compare that one too against your own implementation?

I would say that this is something different, the smallvec crate provides an alternative to Vec that uses the so-called small buffer optimization, i.e. up to some compile-time constant, items are allocated inline and only when the capacity grows beyond that the data structure spills to the heap.

So the "vec" in "smallvec" has the meaning of multiple elements layed out contiguously in memory whereas the "vec" in the ShortVec here is that from vector space, i.e. relating to the mathematical operations acting on those elements.

Using SmallVec as a backing store for ShortVec would not really help as the dimension is statically know but SmallVec is about degrading gracefully when the capacity surpasses the compile-time expectation. Using SmallVec to generally store the vectors in the pos and cov_samples arrays would almost surely not be an improvement when they have to spill because cache locality would be even worse compared to strided matrix access.

Is there a reason you prefer .dim().0 over .shape()[0]?

Only stylistic ones insofar as in matrix meddling code, I prefer the indexing to refer to indexing into matrices and therefore like that using tuples to access the individual dimensions within a multi-index separates that action visually. Also tuples won't decay to slices like arrays do and used to be easier to pattern match. Finally, in the one dimensional case, the tuple indexing is completely unnecessary.

Apart from that, yeah, another ~25% speed up!!

Indeed, but I have to say that the "RFC" moniker is sincere, i.e. I am not sure the speed up is worth the additional maintenance cost as there are now not one but two implementations of the summator_incompr function.

Does it make sense to create the fn inner and fn inner_short_vec via a macro? It seems like an obvious candidate, but for the time being, I cannot tell, if things get much messier with a macro in this case.

I am not sure I understand: Their bodies are different so I am not sure which repetitive part of the code a macro could reduce? Do you mean the function signatures? If so, I would suggest avoiding that as function signatures are the most important kind of contract in Rust programs. For example, they are not influenced by type interference so that all type checking is function-local.

If this technique is applied on a larger scale, it might make more sense to reign in the boiler plate. But for this PR here, I think explicitness is better, c.f. https://users.rust-lang.org/t/rust-koans/2408/4

:-) The Koan is nice. No, I actually ment the function bodies. But I am really tired and just had a second look at the function bodies and realised that they are indeed pretty different. My bad.

I know of the crate smallvec. Did you compare that one too against your own implementation?

I would say that this is something different, the smallvec crate provides an alternative to Vec that uses the so-called small buffer optimization, i.e. up to some compile-time constant, items are allocated inline and only when the capacity grows beyond that the data structure spills to the heap.

So the "vec" in "smallvec" has the meaning of multiple elements layed out contiguously in memory whereas the "vec" in the ShortVec here is that from vector space, i.e. relating to the mathematical operations acting on those elements.

Using SmallVec as a backing store for ShortVec would not really help as the dimension is statically know but SmallVec is about degrading gracefully when the capacity surpasses the compile-time expectation. Using SmallVec to generally store the vectors in the pos and cov_samples arrays would almost surely not be an improvement when they have to spill because cache locality would be even worse compared to strided matrix access.

Again, I think I mixed up some things, maybe due to sleep deprivation. I'm sorry for wasting your time!

Is there a reason you prefer .dim().0 over .shape()[0]?

Only stylistic ones insofar as in matrix meddling code, I prefer the indexing to refer to indexing into matrices and therefore like that using tuples to access the individual dimensions within a multi-index separates that action visually. Also tuples won't decay to slices like arrays do and used to be easier to pattern match. Finally, in the one dimensional case, the tuple indexing is completely unnecessary.

:+1:

Apart from that, yeah, another ~25% speed up!!

Indeed, but I have to say that the "RFC" moniker is sincere, i.e. I am not sure the speed up is worth the additional maintenance cost as there are now not one but two implementations of the summator_incompr function.

At least I got that right ;-) It is a tough one. Your current Rust implementation is already sooo much faster than the Cython implementation. But 25% is a pretty serious improvement still. Another thing which just came to my mind and should have come to my mind the first time I replied ;-) : We only support 2d and 3d fields for the summate_incompr. And I honestly don't see any applications for higher dimensions. So maybe a solution is to keep the ShortVec<N>, but through away the inner function.

I'm going to think about this again, when I can think properly again.

Just to get this out of the way: Have you tried transposing cov_samples and/ or pos before running the loops?

If so and your current solution is the fastest, then I vote for keeping ShortVec and your implementation, but dropping the general case for pos.dim().0 != 2 && pos.dim().0 != 3 and also for including this as an assertion (if you are busy, I can add that later). By the way, I'm not even sure if pos.dim().0 > 3 is mathematically well defined. For pos.dim().0 == 1 it doesn't make any sense. I think I am not 100% convinced by the name ShortVec, but something like CacheLocalVec sounds very clunky and is also probably not 100% accurate. Maybe you have a better idea? - Otherwise, we'll keep it like it is. In case you like documenting your code, maybe it's a good idea starting with the documentation for GSTools-Core with ShortVec, as it is the only module, which is not exposed publicly. In case you're not so much into documentation, I'll put it onto my todo-list. But as you are the author of the module, I'm going to ask you first ;-)

Just to get this out of the way: Have you tried transposing cov_samples and/ or pos before running the loops?

Just transposing using reversed_axis does not change the underlying memory accesses because ndarray will just return the same data with the layout switched from row-major to column-major. But I did also try actually transposing, i.e. collecting the data into a new array with the transposed shape but still in row-major order and this did not result in significant improvements.

In case you like documenting your code

"like" is a strong word, but yes, I will add the documentation for that module. (Just not today.)

If so and your current solution is the fastest, then I vote for keeping ShortVec and your implementation, but dropping the general case for pos.dim().0 != 2 && pos.dim().0 != 3 and also for including this as an assertion (if you are busy, I can add that later).

I will remove that code path tomorrow after rebasing this and fixing up the documentation.

I think I am not 100% convinced by the name ShortVec, but something like CacheLocalVec sounds very clunky and is also probably not 100% accurate. Maybe you have a better idea?

ShortVec was the best I could come up with but I do think naming is important so I am interested in due process for finding a better name. I think cache locality is only the consequence though here. Just as the ability to vectorize is. Both are a consequence of the dimension fixed dimension which I would therefore suggest to emphasise. Maybe VecN or ConstDimVec?

Of course, as N could be arbitrarily large, the vectors do not need to be "short" at all. So that part of the name is questionable. But then again, that dot is faster using this approach compared with transposing the input matrix in memory is probably only true for short vectors. As N grows, I would expect the transposed version to actually be faster. So maybe "short" should stay part of the name in some way or another after all?

(For example, ndarray's dot will check for contiguous layout of both operands for every invocation and then call an eight-fold unrolled loop to do the actual computation which is at least 4 operands more than we actually have.)

Removed the generic implementation and add some documentation for the ShortVec type. Its name is still up for debate though...

The fact that ShortVec has a fixed dimension gets clear pretty fast from the first line of the newly added documentation (thanks for that!). And as its main usage is for small dimensions, maybe ShortVec is a good name after all. If you are fine with it, I would keep the name as it is and merge your PR. In case we come up with a better name, we can still easily replace it, as it is a non-public module.

Sounds to good to me.

GeoStat-Framework / GSTools-Core

RFC: Specialise summator_incompr for two- and three-dimensional problems #8