Parallel performance - Githubissues

Implementing multicore simulations in #212 means one can throw many CPUs at a simulation, which can drastically reduce the runtime.

However, the code contains a number of inefficiencies. This is very clearly highlighted by the atrocious scaling of the 1.5 layer benchmark - above ~10 cores the simulation takes longer with more cores.

Issues to be addressed:

there are too many exchange calls. Many of these can be removed if we improve the use of the halo regions. Once per time step we should update the halos for u, v, h, and, eta. All the other exchange calls are extraneous and can be replaced by improved use of the halos. Subroutines that contain exchange calls to remove:
- [x] evaluate_zeta
- [x] evaluate_dhdt
- [x] evaluate_dudt
- [x] evaluate_dvdt
- [x] evaluate_b_iso
- [x] evaluate_b_RedGrav
The exchange calls are inefficient. Currently the entire array is exchanged between all the cores because it was conceptually easier, and I wanted to get something that worked. In reality, we only need to exchange the halo regions between the core that owns it and the core(s) that want it. By using MPI_SEND and MPI_RECEIVE we should be able to cut down the time spent communicating, giving a boost to both performance and scaling.
A number of subroutines create array temporaries because they are being passed discontiguous chunks of arrays. Addressing this can apparently improve both performance and scaling.

I think the easiest path forwards is to make better use of the halos and eliminate the extraneous exchange calls. For a quick sanity check I removed the calls to see what would happen; the scaling performance improves, but the test suite fails because there are slight differences at the tile edges.

edoddridge / aronnax

Parallel performance #218