Closed kburns closed 10 months ago
@lecoanet any hesitation to merge?
This looks really good to me!
Have we done any performance testing of this at scale? That is my hesitation. I feel like we have tested lots of things that seem like they should make things faster, but actually slowed things down, so it would be good to know. But I don't have a good performance test. I guess we could use a higher-resolution version of some of the 3D test problems?
On 64 cores on Flatiron, here are the runtimes of the internally heat convection example in the ball. Resolution is (128, 64, 96) and its about 1000 iterations.
master: 182.6s new: 180.9s
I tested two problems on 2048 Pleiades ivybridge cores:
3D cartesian convection/stably stratified simulation, Nx=Ny=256, matched Chebyshev with Nz=512+256. Ran for 100 iterations (after 10 iteration startup). I ran a couple of times and found the timing appeared to be reproducible.
master: 66.2 sec PR: 66.8 sec
3D shell example problem, Nphi=1024, Ntheta=512, Nr=512. Ran for 100 iterations (after 10 iteration startup). Here are the run times running it twice:
master: 197.0, 197.1 PR: 195.7, 197.0
So for larger problems, it seems like it maybe hurts/helps by ~1% of the run time, so I'm happy with the performance.
I've moved the head of master back to 5a8bbaf because the merge here included updates to cython which subsequently quickly broke. Thats now a separate branch which I'll rebase and merge to this when cython is fixed.
This PR takes advantage of the fact that our preconditioners are now semi-orthonormal, so their pseudoinverses are trivial to compute and apply. This means we no longer need to store non-right-preconditioned versions of the matrices for dot products during timestepping. We can therefore just store fully preconditioned versions, and apply the various preconditioners during the gather and scatter calls for each subproblem. This really simplifies the timestepping and other solver code.
The PR also includes some changes to reduce allocations during the timestepping loop. In particular, views into the field variables are cached for each subsystem to speed up the gather/scatter steps. The only allocations in the timestepping, of either buffers or new array views, should just be inside the F evaluation or by the matsolver itself.