Performance regression in next

ZedThree commented 2 years ago

I noticed some quite alarming performance regressions between master and next, around a factor 5 slow down. I've looked at the conduction and blob2d examples. Built with CMake -DCMAKE_BUILD_TYPE=Release -DCHECK=0.

Conduction: 2.2s (master) vs 9.7s (next)
blob2d (delta_1): 52s (master) vs 287s (next)

Same number of iterations, but rhs massively slower. Probably going to need to bisect to find out when this happened.

If other people could verify this too, that would be useful.

bshanahan commented 2 years ago

Can confirm for blob2d, but not for conduction which runs the same (same RHS, sometimes slightly slower, sometimes slightly faster depending on the timstep).

dschwoerer commented 2 years ago

Might be related that the hermes-2 model run has 99.8 % of the double instructions performed done as scalar rather then SIMD?

d7919 commented 2 years ago

Might be related that the hermes-2 model run has 99.8 % of the double instructions performed done as scalar rather then SIMD?

That's an interesting observation. The BOUT_FOR loops are meant to help things vectorise, so if this isn't happening that's a good place to start looking.

Very little has changed here between master and next, so probably not that.

d7919 commented 2 years ago

For blob2d it looks like the difference is coming from Delp2. For reference this model uses raw DDZ, bracket with the simple method (using VDDX, DDX, DDZ, VDDZ) and Delp2. At least master without delp2 gave me 1m59s and next gave me 2m8s, so I'm considering these to be effectively the same here.

d7919 commented 2 years ago

OK I think I've tracked this down. Delp2 uses Laplacian::tridagCoefs with the overload in which we don't specify kwave. In master we have kwave = jz * 2.0 * PI / coords->zlength(); whilst in next we have kwave = jz * 2.0 * PI / coords->zlength()(jx,jy);. It seems that indexing into zlength has a significant penalty here. I've confirmed that replacing this in next with the value of zlength from master I recover the performance of master.

d7919 commented 2 years ago

Zlength in next returns a Field2D, although without the 3D metrics the calculation is scalar and implicit cast/construction from scalar to Field2D must be happening. Maybe this is somewhat slow? Worth noting that here this is happening at the bottom of a loop over {x,y,z} so is done a lot.

Is there a reason not to precompute/cache zlength?

d7919 commented 2 years ago

Sorry I forgot that dz is now a Field2D/3D in next, so this is always do Field2D * BoutReal rather than a conversion.

d7919 commented 2 years ago

Hacking in the following recovers the performance of master on next (only implemented for the 2d metric branch).

  FieldMetric dx, dy, dz; ///< Mesh spacing in x, y and z                                                                                                                                     
  Field2D zlength_cache;
  Field2D zlength() {
    static bool visited = false;
#if BOUT_USE_METRIC_3D
    Field2D result(0., localmesh);
    BOUT_FOR_SERIAL(i, dz.getRegion("RGN_ALL")) { result[i] += dz[i]; }
    return result;
#else
    if (!visited) {
      visited = true;
      zlength_cache = dz * nz;
    }
    return zlength_cache;
#endif
  } ///< Length of the Z domain. Used for FFTs

dschwoerer commented 2 years ago

The reason to not pre-compute is, that dz might get normalised, and always recomputing it, ensures this stays consistent.

As this is a performance issue, we could either cache the result by the calling function, or just use the cached value, and with CHECK >= 3/4 recompute and compare.

d7919 commented 2 years ago

Thanks, that makes sense.

It sounds like making caching an opt in might be reasonable (favour correctness over performance), but does then require the user to be aware of this. There may be a few things we wish to cache, could we at some point mark the coordinates as "frozen/fixed" and then calculate the cached values? Apologies if this isn't possible/sensible, it has been a while since I've written a model.

dschwoerer commented 2 years ago

I am not sure about caching being opt-in. If the user is aware of it, it should always be enabled. On the other hand, for new users, that seems like a huge performance regression, that we should try to avoid.

We could calculate this after the user-init function (something like mesh->doCache() which in turn calls coordinates->doCache() - after that I think is very unlikely anything will be changed, and if so, the user has to call doCache again. If we want, with CHECK=4 we could verify that the cached values are correct. I think we should advocate somewhere to always run a model with CHECK=3/4 for a few time steps (with full grid?), and only then switch to CHECK=0 for production runs.

johnomotani commented 2 years ago

I'd expect almost always the Coordinates should be fixed after PhysicsModel::init() is done (if someone wants time-varying geometry they have a lot of other things to look after too!).

There is PhysicsModel::postInit() which is called after init(), so we could mark Coordinates as 'fixed' in there potentially? Currently PhysicsModel::postInit() is protected rather than private, so someone could in principle override it in a derived class, but looking at the method body I don't think we expect that ever to happen - it sets up the dump and restart files, and the modelMonitor.

d7919 commented 2 years ago

Yes I think this sort of approach sounds good. I think I've played about in the past with caching 1/dx etc. for use in derivatives etc. I think this leads to a small speed up but does increase memory use.

I guess we have to consider the fact that we support multiple meshes and coordinates so whilst we can freeze the default mesh/coords we may not be able to do this automatically in general so may need to provide some user facing functionality to do this. As multi-mesh is an advanced feature I think it's ok to expect the user to be aware of such things. (not an expert on this either, so maybe it's fine).

d7919 commented 2 years ago

The conduction example doesn't use Delp2 so hopefully the reported degradation in this is just jitter, as suggested by conflicting reports on this.

We use zlength in a similar way in the free shiftZ function in Field3D. Most other uses seem to be through getUniform which is then cached

bendudson commented 2 years ago

Thanks @ZedThree for spotting and @d7919 for finding the issue! It would be nice to try and make a lot more of this machinery immutable, so that updates were performed by replacing objects and caching could be done more safely. We've talked about things like only calculating the Christoffel symbols on demand, for example.

One option might to be clear caches when the user calls mesh->geometry(), as they should already do this after normalising or changing metrics.

ZedThree commented 2 years ago

Invalidating any caches in Coordinates::geometry is probably the best idea for now. When we come to refactor Coordinates completely, we should use getters and setters for everything so we will know exactly what needs invalidating/recalculating at the correct time.

It looks like caching zlength is necessary but not sufficient. For me, it brings the next slowdown to a factor 2x, rather than factor 5x.

I added a timer around each term in blob2d, here's the results with master, next, and next + zlength caching:

Timer name	master	next	next + zlength
run	44.6049	263.635606	115.787194
rhs	28.11	214.622938	79.1557525
invert	10.6976	9.63476772	13.1874872
io	1.45747	1.586343	1.63256172
comms	1.22491	1.49394616	1.74851247
n bracket	4.84925	6.65349723	6.5611175
n ddz	1.08181	1.59610546	1.57796023
n delp2	2.59303	93.8961676	23.995005
n sheath	0.15163	0.22712784	0.240881384
omega bracket	3.58667	5.37294135	5.52158618
omega ddz	1.09066	1.56167886	1.66319487
omega delp2	2.39409	93.4746835	23.8720075
omega sheath	0.09619	0.13800992	0.155987478

Even with caching zlength, delp2 is an order of magnitude slower! Bracket and ddz are also significantly slower.

@d7919 Did you see better improvement than this with caching zlength? There's definitely a fair amount of jitter in my results here, but it also definitely looks like there's something else going on too.

d7919 commented 2 years ago

@ZedThree when I used caching it seemed to pretty much perfectly recover the performance of master, although I didn't do this level of timing.

I did also have to build next as a static lib rather than shared. But that's the only other thing I changed.

ZedThree commented 2 years ago

This is how I'm building the two branches:

cmake . -B build_next_opt -GNinja \
  -DCMAKE_BUILD_TYPE=Release -DCHECK=0 \
  -DBUILD_SHARED_LIBS=on \
  -DBOUT_USE_PETSC=ON -DPETSC_DIR:PATH=/home/peter/Tools/petsc/installed/3.15.0-openmpi4 \
  -DBOUT_BUILD_EXAMPLES=on

and running like:

mpirun -np 16 ./blob2d -d delta_1 -time_report:show=true input:error_on_unused_options=false

I do only have 8 physical cores, so that is plausibly an issue?

d7919 commented 2 years ago

I'm doing something like

cmake . -B build -DCMAKE_BUILD_TYPE=Release -DCHECK=0 -DBUILD_SHARED_LIBS=off

I then run the same thing with prefix etc. set in the blob2d dir. I then run with mpirun -np 32 --bind-to core ./build/blob2d -d delta_1. This on a machine with 32 physical cores.

ZedThree commented 2 years ago

Running like mpirun -np 8 -bind to core gives me:

Timer name	master	next	next + zlength
run	41.173	300.705235	44.2209333
rhs	27.6943	269.642543	30.8883797
invert	9.07449	8.8451217	8.51863951
io	0.84887	1.02272549	0.919394656
comms	0.684314	0.783785886	0.645662398
n bracket	5.09911	6.51896736	4.79567161
n ddz	1.17155	1.52217002	1.09371776
n delp2	2.94725	122.334548	5.19087748
n sheath	0.162111	0.24286286	0.14794967
omega bracket	4.20319	5.61457184	3.95868073
omega ddz	1.19902	1.54012749	1.12180734
omega delp2	2.82852	121.62549	5.08194543
omega sheath	0.104803	0.158824281	0.084123548

Much better! I'm still a bit worried that we have lost a bit of performance over 4.4, but it's possible that it's in the noise. Certainly now looks like <10%

d7919 commented 2 years ago

The Delp2 time still looks somewhat worse. I notice that halving the number of cores doesn't really change the delp2 time on master.

I've not timed things carefully enough to know if I saw the same behaviour with delp2 or not.

ZedThree commented 2 years ago

Those numbers are fairly robust, the jitter is maybe +/- 2s on the total run time, so 5%? delp2 time is outside of that, while most other times are within that. It looks like the delp2 difference can explain pretty much all of the total difference.

Yes, looks like hyperthreading makes no difference for master but is absolutely murdering delp2 in next.

boutproject / BOUT-dev

Performance regression in next #2470