fmihpc / vlasiator

Vlasiator - ten letters you can count on
https://www.helsinki.fi/en/researchgroups/vlasiator
Other
48 stars 39 forks source link

Ionosphere solver threading/performance #639

Open ykempf opened 2 years ago

ykempf commented 2 years ago

We have a high iteration count over many thousand ionosphere grid vertices (called nodes in the code), and each iteration involves several summations of values spanning several orders of magnitude. If we run twice the same setup, plain "random order" threaded summation yields differences of order 1e-5 straight away even with our usage of Real being double, and if we run these two instances further we quickly have "macroscopic" differences in values of interest, e.g. the cross polar cap potential (that is percent-level discrepancies). This was hampering debugging significantly, hence my effort to yield reproducible summations/solutions of the ionosphere.

I came up with the scheme that can be followed with this first summation, https://github.com/fmihpc/vlasiator/blob/dev/sysboundary/ionosphere.cpp#L2458-L2503, where each thread stores the values to be summed in a multisetthereby providing sorted summation order, before a #critical section where each thread inserts into a global set, and then a #single summation over the whole, thereby sorted multiset. In summations with both positive and negative numbers, there's two sets and we sum them separately in opposite order before summing the overall positive and negative sums.

This yielded what I was after, at a significant performance penalty. In my test with 9920 ionosphere nodes, running 20000 iterations, it took 2239 seconds over one day to solve the ionosphere 24 times (once every 4 s), i.e. on average 93 s each call. The solver exits if it converges to 1e-6 but it did only once, typical convergence rates being a few times 1e-6 to a few times 1e-5. Arguably this would also be good enough, so we could shorten by relaxing the tolerance. But the largest factor is definitely this cumbersome summation mechanism.

Any discussion/suggestions welcome. Given the cumbersomeness of this algorithm, having a different implementation to offer a cfg option of "good" vs. "bad" summation quality would make this part of the code even less readable I fear... Would summing vectors ordered by node ID be enough, and pöh about summation order/precision loss?

markusbattarbee commented 2 years ago

Oof, that is a lot of insertions into sorted sets, and especially as you finally insert all of them into one big sorted set, it is really expensive twice over. I would recommend one of the following:

1) Each thread gathers an unsorted set (or just a vector), then after it's gathered, the thread sorts it, and sums it together. Then finally the sums of the threads are summed.

2) Use frexp (https://cplusplus.com/reference/cmath/frexp/) to extract the exponent of each value. Use that as an index for an array of iSolverReal (one for each thread), adding the value to the index entry. Sum over threads, re-create actual iSolverReal values, and sum them.