What are typical problem sizes?

adamreichold commented 2 years ago

Considering my experience benchmarking the summator and krige functions so far and finding two significantly different problem sizes in the benchmark input generation script, I wonder what typical problem sizes in the real world are?

This is important insofar the best strategy for parallelizing the algorithms does depend on the problem sizes. For example, using the larger of the two problem sizes, not parallelizing the second loop level in the summator_incompr function and parallelinzing it in the krige functions both improve performance as indicated in the commit messages.

Using larger problem sizes does increase the benchmark runtime significantly, but using Criterion's --save-baseline and --baseline features as well as reducing the --sample-size during iterative development would make this feasible IMHO, if the larger problems are actually more realistic that is.

Further, if problem size is expected to vary significantly, it might be preferable to always limit parallelization to the outermost loops as a conservative choice that should always yield acceptable performance.

adamreichold commented 2 years ago

As more literary thought: It appears noticeable to me that a memory and thereby concurrency safe language enables one to spend time thinking about interesting problems like "how much parallelism is actually beneficial for my typical problem sizes" instead of having to invest the same mental energy into achieving parallelism without making mistakes at all.

LSchueler commented 2 years ago

As more literary thought: It appears noticeable to me that a memory and thereby concurrency safe language enables one to spend time thinking about interesting problems like "how much parallelism is actually beneficial for my typical problem sizes" instead of having to invest the same mental energy into achieving parallelism without making mistakes at all.

That's a nice thought! Although, in a perfect world, I think there shouldn't be any performance penalties for inserting Rayon parallel iterators, as they actually decide how to divide the data.

LSchueler commented 2 years ago

I just stumbled upon a small project where I actually did some Kriging. It had 103 data points (len of cond) and I did the interpolation on a 2d mesh with 500 by 500 grid cells. But Sebastian will soon also join the discussion.

adamreichold commented 2 years ago

Although, in a perfect world, I think there shouldn't be any performance penalties for inserting Rayon parallel iterators, as they actually decide how to divide the data.

Well they do, but some amount of overhead is often unavoidable: Even if Rayon's heuristic makes the decision to not parallelize at all, just making it will have a runtime cost. Furthermore, parallel algorithms often need intermediate buffers to avoid synchronizing writes which is an overhead paid even by effectively single-threaded instantiations.

Another thing to note is that while Rayon does make the decision, it does not know the workload, i.e. whether it is beneficial to put every item into a separate task or just batches of 1000. For indexed parallel iterators this can be controlled via with_min_len but ndarray's Zip does seem to be limited to geometric splits (i.e. it has a split but no split_at method like ArrayBase and friends). Not sure yet how this can be worked around...

adamreichold commented 2 years ago

Not sure yet how this can be worked around...

So instead of doing nothing like I am supposed to at the moment, I banged my head into this a few times. I was able to further improve the performance of the summator_incompr function by 25% by switching the loop nesting so that the k_2 computation is not repeated redundantly which is the last commit on this branch. However, while the slow down is not as bad as before, parallelizing this implementation is still slower than the parallel version.

I think the main culprit here is https://github.com/rust-ndarray/ndarray/blob/566177b2e4d4556f841eebe78cafdfb1632235bf/src/zip/mod.rs#L878 which is used to determine if a Zip can be further split. This leads to a large number of intermediate temporary arrays being allocated, e.g. if 1_000 modes are used then on average 500 temporaries are allocated on my 8-hardware-thread CPU. (This was significantly worse before switching the nesting order yielding millions of temporaries and hence allocations, even though each one was just a 3x1 array instead of a 3x100_000 array.)

If I use the slightly contrived version

cov_samples
    .axis_iter(Axis(1))
    .into_par_iter()
    .zip(z1.axis_iter(Axis(0)))
    .zip(z2.axis_iter(Axis(0)))
    .with_min_len(125)
...

to get a IndexedParallelIterator and thereby access to with_min_len and set the optimal block size of 1_000 / 8 = 125 for my system, I do get measurable speed-ups

Benchmarking field summate incompr: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 10.0s.
field summate incompr   time:   [987.01 ms 990.54 ms 994.57 ms]                                
                        change: [-8.0567% -7.6046% -7.1657%] (p = 0.00 < 0.05)
                        Performance has improved.

but I feel both the implementation and the fine tuning are wrong. I am not sure how to best fix this, whether in Rayon (Should it be possible to limit the number of jobs a Fold creates?) or in ndarray (Should NdProducer: IntoIterator and thereby Zip: IndexedParallelIterator?)

adamreichold commented 2 years ago

I am not sure how to best fix this

https://github.com/rayon-rs/rayon/pull/857 improves things considerably and I opened https://github.com/rust-ndarray/ndarray/pull/1081 which would allow limiting of the splitting depending on work load characteristics.

Using only the patch to ndarray and the following parallel implementation

Zip::from(cov_samples.columns())
    .and(z1)
    .and(z2)
    .into_par_iter()
    .with_min_len(100)
    .fold(
        || Array2::<f64>::zeros(pos.dim()),
        |mut summed_modes, (cov_samples, z1, z2)| {
            let k_2 = cov_samples[0] / cov_samples.dot(&cov_samples);

            Zip::from(pos.columns())
                .and(summed_modes.columns_mut())
                .par_for_each(|pos, mut summed_modes| {
                    let phase = cov_samples.dot(&pos);
                    let z12 = z1 * phase.cos() + z2 * phase.sin();

                    Zip::from(&mut summed_modes)
                        .and(&e1)
                        .and(cov_samples)
                        .for_each(|sum, e1, cs| {
                            *sum += (*e1 - cs * k_2) * z12;
                        });
                });

            summed_modes
        },
    )
    .reduce_with(|mut lhs, rhs| {
        lhs += &rhs;
        lhs
    })
    .unwrap()

I get small but consistent speed-ups, e.g.

Benchmarking field summate incompr: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 10.1s.
field summate incompr   time:   [1.0061 s 1.0096 s 1.0120 s]                                   
                        change: [-6.3574% -5.9467% -5.7063%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 10 measurements (20.00%)
  1 (10.00%) low severe
  1 (10.00%) low mild

LSchueler commented 2 years ago

Those are some really cool insights into Rayon! Somehow summator_incompr is a tough nut, right?

adamreichold commented 2 years ago

Somehow summator_incompr is a tough nut, right?

Actually, I think this more of an ecosystem "issue": Rayon makes it almost trivial to parallelize algorithms once libraries like ndarray provide the necessary plumbing for their data types. Most of the time this is more than enough to achieve useful speed-ups, but remembering how the same algorithms would be handled using e.g. manual threading or even OpenMP, it would be rather surprising that some tuning would not be beneficial or sometimes even required to achieve speed-ups. Rayon is often described as "magic" by developers first making use of it. Alas, there is no such thing and when the defaults do not work out, one has to look deeper into the stack.

Since I do not know how long it will take to land https://github.com/rust-ndarray/ndarray/pull/1081 (if it lands at all), I think I will commit the "iterate along the first axis of a single axis" trick as a workaround for now. I think we should be in a position to merge this afterwards? The problems you looked up are as large or larger than the benchmarks now and all algorithms (expect variograms which are not yet index-free) would benefit from parallelisation.

adamreichold commented 2 years ago

Just for old times sake:

I just checked a publication of mine, where I used a very old implementation in C++ of the summator_incompr function. I used a 2d grid of 4600 x 1800 ~ 8*10^6 cells, which I would consider a larger problem size (for a 2d problem). I also actually used 6400 modes (len of z1, z2), which is probably a bit much (nowadays I always use exactly 1000 modes). Running one calcuation took 48 minutes on JURECA.

diff --git a/benches/gen_benchmark_inputs.py b/benches/gen_benchmark_inputs.py
index 142581d..8485d5f 100755
--- a/benches/gen_benchmark_inputs.py
+++ b/benches/gen_benchmark_inputs.py
@@ -8,8 +8,8 @@ from gstools.field.generator import RandMeth, IncomprRandMeth

 def gen_field_summate(path, seed):
-    pos_no = 100_000
-    mode_no = 1_000
+    pos_no = 4600 * 1800
+    mode_no = 6400
     x = np.linspace(0.0, 10.0, pos_no)
     y = np.linspace(-5.0, 5.0, pos_no)
     z = np.linspace(-6.0, 8.0, pos_no)

> /usr/bin/time ./target/release/examples/incompr
4151.12user 19.38system 8:53.53elapsed 781%CPU (0avgtext+0avgdata 6405948maxresident)k
0inputs+0outputs (0major+3149538minor)pagefaults 0swaps

GeoStat-Framework / GSTools-Core

What are typical problem sizes? #6