Slow simulation NS2D over a biperiodic space?

kburns commented 6 years ago

Original report (archived issue) by Pierre Augier (Bitbucket: paugier).

The original report had attachments: ns2d_rot.py, dedalus.cfg

Hi,

I am Pierre Augier, one of the developer of fluidsim (https://bitbucket.org/fluiddyn/fluidsim).

In order to compare our performance with other pseudo-spectral CFD codes, we tried Dedalus.

I suspect that we are not doing it the right way (see our script https://bitbucket.org/fluiddyn/fluidsim/src/master/bench/dedalus/ns2d_rot.py) because Dedalus is much (approximately 30 times on my computer) slower than the other codes.

Since we are going to include these comparisons in an article, we would like to get the best of Dedalus. Is there something that I can do to get better performance with Dedalus for this very simple case (NS2D over a biperiodic space, 10 time steps)?

I tried with 512**2 and 1024**2 and I got similar results.

kburns commented 6 years ago

Original comment by Pierre Augier (Bitbucket: paugier).

Edited issue description

kburns commented 6 years ago

Original comment by Keaton Burns (Bitbucket: kburns).

Hi Pierre,

Thanks for your interest in comparing Dedalus to fluidsim -- it looks like a great project.

It looks like the version of the script in your repository is currently throwing a singular matrix error because the gauge of the streamfunction isn't specified, so I've modified the equations a bit to set it to zero (script attached). With those changes, and the default Dedalus settings, I'm seeing a baseline time running the script serially with n=256 of T0 = 4.39 seconds on my laptop. There are three major improvements I'd recommend making:

1) Dedalus lazily constructs the required transform and transposes plans the first time they are required, which is typically during the first timestep. This means the first timestep should usually be considered as a startup cost, and not indicative of the simulation speed. If I simply copy your main loop to run 10 startup iterations, and then time the following 10, I get a time of T1 = 4.25 seconds. Note this startup cost should become less important at higher resolutions, but maybe more important in parallel (due to transpose planning).

2) The most important thing for improving performance is to set the "STORE_LU" option to True in the Dedalus configuration file. This will store and re-use the LU factorization of the LHS matrices when the timestep is unchanged from the previous iteration. It is currently off by default (which we should probably change), because the LU factorization library wrapped in Scipy can have an enormous memory footprint, and we were leaning towards stability over speed for the default settings. Changing this flag, I get a time of T2 = 1.38 seconds.

3) Finally, I noticed you're using the RK443 timestepper. This is a 4-stage 3rd order Runge-Kutta method, which will be evaluating the RHS expressions and solving the LHS matrices 4 times per iteration. If you're using the same method for other codes, that's ok, but otherwise it's probably most fair to pick timesteppers with the same number of solves per iteration. A good substitute might be SBDF3, which is a 3rd order multistep method that only uses one solve per iteration. Switching to SBDF3, I get a time of T3 = 0.38 seconds.

I'd also point out that Dedalus doesn't implement any fully explicit timesteppers -- they are all IMEX schemes, which may make comparisons to fully explicit codes a little tricky, since you're trading off speed-per-iteration for stability with larger timesteps. From our previous comparisons, we very roughly expect Dedalus to be 2-4x slower than other implicitly-timestepped Fourier pseudospectral codes -- I think it's fair to say that our focus so far has been optimizing for bounded domains with Chebyshev methods.

Best, -Keaton

kburns commented 6 years ago

Original comment by Keaton Burns (Bitbucket: kburns).

set attachment to "ns2d_rot.py"

ns2d_rot.py, updated to set streamfunction gauge, use SBDF3, and separate startup loops from timing loops.

kburns commented 6 years ago

Original comment by Keaton Burns (Bitbucket: kburns).

set attachment to "dedalus.cfg"

Configuration file with STORE_LU set to True.

kburns commented 6 years ago

Original comment by Pierre Augier (Bitbucket: paugier).

Hi Keaton,

Thank you for your nice answer.

For simplicity and to be fair with all codes, I will simply compare the elapsed time for 10 RK4 time steps. All codes implement a Runge-Kutta 4 scheme and it is often a good and simple choice for real life simulations.

For the considered case (NS2D, Fourier-Fourier, RK4), Dedalus is indeed quite slow (~ 15 time slower than fluidsim). Of course I'm going to point out that Dedalus is very versatile and that it has been more optimized for bounded domains with Chebyshev methods.

kburns commented 6 years ago

Original comment by Keaton Burns (Bitbucket: kburns).

Hi Pierre,

I'm not sure it's the right comparison -- if I understand correctly, the other codes are implementing the classic 4-stage explicit RK4 method, correct? Our RK443 is NOT this scheme. It is a 4-stage, third-order mixed implicit-explicit scheme described in Ascher 1997. The first is fully explicit but the second is performing four implicit matrix solves per iteration. They are very different schemes with very different stability properties, with the IMEX scheme allowing for much larger timestep sizes in practice.

Since the codes do not implement comparable methods, perhaps a better test of performance is to compute the time necessary to compute a particular solution within a given accuracy, allowing for different timesteps between different integrators? We'd be happy to help set this up if you're interested.

Best, -Keaton

kburns commented 6 years ago

Original comment by Pierre Augier (Bitbucket: paugier).

Ok I understand your point. Dedalus does not also implement the classic RK4 method ? Or the classical RK2 method ?

I can't download the article (Elsevier) so I can't really study this RK443 scheme. Are the equations summarized in the documentation of Dedalus or in another open document that I could get? How do you choose the value of the time step for this scheme? Is it based on a CFL coefficient?

Note that the linear terms are treated fully implicitly in some of the other codes (exact integration).

Time stepping is a complicated subject (and there is also the issue of phase shifting which changes everything!), so it is not simple to compare the performance of different schemes. This is why I would prefer to compare the raw performance of the codes with a standard and simple time stepping method.

kburns commented 6 years ago

Original comment by Keaton Burns (Bitbucket: kburns).

Hi Pierre, sorry for the delay, I was wrapping up my thesis and then took some time off! Currently, we just implement IMEX schemes, so no fully explicit methods or exponential-explicit methods, since these aren't practical for the matrices that come from Chebyshev discretizations. The tableaus of the implemented schemes are listed in the timesteppers.py module, and the general form for both the IMEX RK schemes and IMEX multistep schemes are listed in the class docstrings there.

For a fluid simulation, the timestep is usually based on a CFL coefficient when the viscous terms are integrated implicitly. In practice we find that the maximum stable safety factor can vary by a substantial amount for different integrators depending on the equation set, which is why we took the approach of just implementing a range of options and letting the user test and pick the best option for their specific equations.

We've thought a bit about implementing some exponential timesteppers which should speed things up for constant-coefficeint, fully-Fourier problems, but haven't gotten around to this yet since we're all primarily using Chebyshev discretizations in our research. This would also be a welcome pull-request if anyone reading would like to take a crack at it!

kburns commented 6 years ago

Original comment by Keaton Burns (Bitbucket: kburns).

Hi Pierre, another big thing to check -- are you trying to compare to other spectral codes using 512 x 512 dealiased modes or a 512 x 512 grid? In Dedalus, the "resolution" of the bases corresponds to dealiased modes, and dealiasing is done by padding the modes by 3/2 before transforming, so these Dedalus simulations correspond to a grid size of 768 x 768. If you're comparing to other codes which start with a 512 x 512 grid and apply a 2/3 truncation to dealias, then the right comparison would be to set the Dedalus basis resolution to 341, and the dealias keyword to 512/341 to end up on a 512 x 512 grid.

kburns commented 5 years ago

Original comment by Keaton Burns (Bitbucket: kburns).

Ok I took a closer look at the script, and noticed that there's also big improvements we can make to the problem formulation. In Dedalus, only Chebyshev problems need to be reduced to first order, but higher-order derivatives are fine with Fourier bases. This means all of the diagnostic equations here can actually be replaced with substitution rules relating rot, u, and v to psi. Making these changes also speeds up the code quite a bit, in addition to compensating for the different dealiasing strategies. Currently timings on my laptop look like:

FluidSim:
512^2 grid: 0.56 sec
1024^2 grid: 2.76 sec

Old Dedalus script:
512^2 modes: 5.73 sec
1024^2 modes: 26.93 sec

Updated Dedalus script:
512^2 grid: 1.19 sec
1024^2 grid: 6.78 sec

I'll post this over on the FluidSim issue as well.

kburns commented 5 years ago

Original comment by Pierre Augier (Bitbucket: paugier).

I confirm the nice improvement for Dedalus! Here are my measurements:

augier3pi@meige8pcpa79:~/Dev/fluidsim/bench/dedalus$ time python ns2d_rot_faster.py
2018-10-24 10:17:24,379 pencil 0/1 INFO :: Building pencil matrix 1/171 (~1%) Elapsed: 0s, Remaining: 38s, Rate: 4.5e+00/s
2018-10-24 10:17:28,216 pencil 0/1 INFO :: Building pencil matrix 18/171 (~11%) Elapsed: 4s, Remaining: 35s, Rate: 4.4e+00/s
2018-10-24 10:17:32,193 pencil 0/1 INFO :: Building pencil matrix 36/171 (~21%) Elapsed: 8s, Remaining: 30s, Rate: 4.5e+00/s
2018-10-24 10:17:34,197 pencil 0/1 INFO :: Building pencil matrix 45/171 (~26%) Elapsed: 10s, Remaining: 28s, Rate: 4.5e+00/s
2018-10-24 10:17:36,202 pencil 0/1 INFO :: Building pencil matrix 54/171 (~32%) Elapsed: 12s, Remaining: 26s, Rate: 4.5e+00/s
2018-10-24 10:17:40,211 pencil 0/1 INFO :: Building pencil matrix 72/171 (~42%) Elapsed: 16s, Remaining: 22s, Rate: 4.5e+00/s
2018-10-24 10:17:44,248 pencil 0/1 INFO :: Building pencil matrix 90/171 (~53%) Elapsed: 20s, Remaining: 18s, Rate: 4.5e+00/s
2018-10-24 10:17:48,339 pencil 0/1 INFO :: Building pencil matrix 108/171 (~63%) Elapsed: 24s, Remaining: 14s, Rate: 4.5e+00/s
2018-10-24 10:17:52,393 pencil 0/1 INFO :: Building pencil matrix 126/171 (~74%) Elapsed: 28s, Remaining: 10s, Rate: 4.5e+00/s
2018-10-24 10:17:54,160 pencil 0/1 INFO :: Building pencil matrix 134/171 (~78%) Elapsed: 30s, Remaining: 8s, Rate: 4.5e+00/s
2018-10-24 10:17:56,405 pencil 0/1 INFO :: Building pencil matrix 144/171 (~84%) Elapsed: 32s, Remaining: 6s, Rate: 4.5e+00/s
2018-10-24 10:18:00,536 pencil 0/1 INFO :: Building pencil matrix 162/171 (~95%) Elapsed: 36s, Remaining: 2s, Rate: 4.5e+00/s
2018-10-24 10:18:02,561 pencil 0/1 INFO :: Building pencil matrix 171/171 (~100%) Elapsed: 38s, Remaining: 0s, Rate: 4.5e+00/s
Starting startup loop...
Run time for startup loop: 2.106695
Starting main time loop...
Run time for main loop: 1.607797

real    0m43.280s
user    0m42.668s
sys 0m0.820s

augier3pi@meige8pcpa79:~/Dev/fluidsim/bench/dedalus$ time fluidsim-bench 512 -d 2 -s ns2d -it 10
nh = (512, 512); Lh = (8, 8)
running a benchmark simulation... done.
10 time steps computed in 0.51 s
results benchmarks saved in
/tmp/fluidsim_bench/result_bench_ns2d_512x512_np=1_default_2018-10-24_10-33-2516769.json

Cleaning up simulation.

real    0m2.603s
user    0m2.132s
sys 0m0.600s

kburns commented 5 years ago

Original comment by Keaton Burns (Bitbucket: kburns).

Great, thanks for taking another look!

kburns commented 5 years ago

Original comment by Keaton Burns (Bitbucket: kburns).

changed state from "new" to "resolved"

DedalusProject / dedalus

Slow simulation NS2D over a biperiodic space? #38