DedalusProject / dedalus

A flexible framework for solving PDEs with modern spectral methods.
http://dedalus-project.org/
GNU General Public License v3.0
489 stars 115 forks source link

Fully precondition matrices #269

Closed kburns closed 10 months ago

kburns commented 11 months ago

This PR takes advantage of the fact that our preconditioners are now semi-orthonormal, so their pseudoinverses are trivial to compute and apply. This means we no longer need to store non-right-preconditioned versions of the matrices for dot products during timestepping. We can therefore just store fully preconditioned versions, and apply the various preconditioners during the gather and scatter calls for each subproblem. This really simplifies the timestepping and other solver code.

The PR also includes some changes to reduce allocations during the timestepping loop. In particular, views into the field variables are cached for each subsystem to speed up the gather/scatter steps. The only allocations in the timestepping, of either buffers or new array views, should just be inside the F evaluation or by the matsolver itself.

kburns commented 10 months ago

@lecoanet any hesitation to merge?

jsoishi commented 10 months ago

This looks really good to me!

lecoanet commented 10 months ago

Have we done any performance testing of this at scale? That is my hesitation. I feel like we have tested lots of things that seem like they should make things faster, but actually slowed things down, so it would be good to know. But I don't have a good performance test. I guess we could use a higher-resolution version of some of the 3D test problems?

kburns commented 10 months ago

On 64 cores on Flatiron, here are the runtimes of the internally heat convection example in the ball. Resolution is (128, 64, 96) and its about 1000 iterations.

master: 182.6s new: 180.9s

lecoanet commented 10 months ago

I tested two problems on 2048 Pleiades ivybridge cores:

3D cartesian convection/stably stratified simulation, Nx=Ny=256, matched Chebyshev with Nz=512+256. Ran for 100 iterations (after 10 iteration startup). I ran a couple of times and found the timing appeared to be reproducible.

master: 66.2 sec PR: 66.8 sec

3D shell example problem, Nphi=1024, Ntheta=512, Nr=512. Ran for 100 iterations (after 10 iteration startup). Here are the run times running it twice:

master: 197.0, 197.1 PR: 195.7, 197.0

So for larger problems, it seems like it maybe hurts/helps by ~1% of the run time, so I'm happy with the performance.

kburns commented 10 months ago

I've moved the head of master back to 5a8bbaf because the merge here included updates to cython which subsequently quickly broke. Thats now a separate branch which I'll rebase and merge to this when cython is fixed.