Closed ZedThree closed 2 years ago
Can confirm for blob2d, but not for conduction which runs the same (same RHS, sometimes slightly slower, sometimes slightly faster depending on the timstep).
Might be related that the hermes-2 model run has 99.8 % of the double instructions performed done as scalar rather then SIMD?
Might be related that the hermes-2 model run has 99.8 % of the double instructions performed done as scalar rather then SIMD?
That's an interesting observation. The BOUT_FOR loops are meant to help things vectorise, so if this isn't happening that's a good place to start looking.
Very little has changed here between master and next, so probably not that.
For blob2d it looks like the difference is coming from Delp2
. For reference this model uses raw DDZ, bracket with the simple method (using VDDX, DDX, DDZ, VDDZ) and Delp2. At least master without delp2 gave me 1m59s and next gave me 2m8s, so I'm considering these to be effectively the same here.
OK I think I've tracked this down. Delp2 uses Laplacian::tridagCoefs
with the overload in which we don't specify kwave
. In master we have kwave = jz * 2.0 * PI / coords->zlength();
whilst in next we have kwave = jz * 2.0 * PI / coords->zlength()(jx,jy);
. It seems that indexing into zlength has a significant penalty here. I've confirmed that replacing this in next with the value of zlength from master I recover the performance of master.
Zlength in next returns a Field2D, although without the 3D metrics the calculation is scalar and implicit cast/construction from scalar to Field2D must be happening. Maybe this is somewhat slow? Worth noting that here this is happening at the bottom of a loop over {x,y,z} so is done a lot.
Is there a reason not to precompute/cache zlength?
Sorry I forgot that dz is now a Field2D/3D in next, so this is always do Field2D * BoutReal
rather than a conversion.
Hacking in the following recovers the performance of master on next (only implemented for the 2d metric branch).
FieldMetric dx, dy, dz; ///< Mesh spacing in x, y and z
Field2D zlength_cache;
Field2D zlength() {
static bool visited = false;
#if BOUT_USE_METRIC_3D
Field2D result(0., localmesh);
BOUT_FOR_SERIAL(i, dz.getRegion("RGN_ALL")) { result[i] += dz[i]; }
return result;
#else
if (!visited) {
visited = true;
zlength_cache = dz * nz;
}
return zlength_cache;
#endif
} ///< Length of the Z domain. Used for FFTs
The reason to not pre-compute is, that dz might get normalised, and always recomputing it, ensures this stays consistent.
As this is a performance issue, we could either cache the result by the calling function, or just use the cached value, and with CHECK >= 3/4
recompute and compare.
Thanks, that makes sense.
It sounds like making caching an opt in might be reasonable (favour correctness over performance), but does then require the user to be aware of this. There may be a few things we wish to cache, could we at some point mark the coordinates as "frozen/fixed" and then calculate the cached values? Apologies if this isn't possible/sensible, it has been a while since I've written a model.
I am not sure about caching being opt-in. If the user is aware of it, it should always be enabled. On the other hand, for new users, that seems like a huge performance regression, that we should try to avoid.
We could calculate this after the user-init function (something like mesh->doCache()
which in turn calls coordinates->doCache()
- after that I think is very unlikely anything will be changed, and if so, the user has to call doCache again. If we want, with CHECK=4
we could verify that the cached values are correct. I think we should advocate somewhere to always run a model with CHECK=3/4
for a few time steps (with full grid?), and only then switch to CHECK=0
for production runs.
I'd expect almost always the Coordinates
should be fixed after PhysicsModel::init()
is done (if someone wants time-varying geometry they have a lot of other things to look after too!).
There is PhysicsModel::postInit()
which is called after init()
, so we could mark Coordinates
as 'fixed' in there potentially? Currently PhysicsModel::postInit()
is protected
rather than private
, so someone could in principle override it in a derived class, but looking at the method body I don't think we expect that ever to happen - it sets up the dump and restart files, and the modelMonitor
.
Yes I think this sort of approach sounds good. I think I've played about in the past with caching 1/dx etc. for use in derivatives etc. I think this leads to a small speed up but does increase memory use.
I guess we have to consider the fact that we support multiple meshes and coordinates so whilst we can freeze the default mesh/coords we may not be able to do this automatically in general so may need to provide some user facing functionality to do this. As multi-mesh is an advanced feature I think it's ok to expect the user to be aware of such things. (not an expert on this either, so maybe it's fine).
The conduction example doesn't use Delp2
so hopefully the reported degradation in this is just jitter, as suggested by conflicting reports on this.
We use zlength in a similar way in the free shiftZ
function in Field3D. Most other uses seem to be through getUniform which is then cached
Thanks @ZedThree for spotting and @d7919 for finding the issue! It would be nice to try and make a lot more of this machinery immutable, so that updates were performed by replacing objects and caching could be done more safely. We've talked about things like only calculating the Christoffel symbols on demand, for example.
One option might to be clear caches when the user calls mesh->geometry()
, as they should already do this after normalising or changing metrics.
Invalidating any caches in Coordinates::geometry
is probably the best idea for now. When we come to refactor Coordinates
completely, we should use getters and setters for everything so we will know exactly what needs invalidating/recalculating at the correct time.
It looks like caching zlength
is necessary but not sufficient. For me, it brings the next slowdown to a factor 2x, rather than factor 5x.
I added a timer around each term in blob2d, here's the results with master, next, and next + zlength caching:
Timer name | master | next | next + zlength |
---|---|---|---|
run | 44.6049 | 263.635606 | 115.787194 |
rhs | 28.11 | 214.622938 | 79.1557525 |
invert | 10.6976 | 9.63476772 | 13.1874872 |
io | 1.45747 | 1.586343 | 1.63256172 |
comms | 1.22491 | 1.49394616 | 1.74851247 |
n bracket | 4.84925 | 6.65349723 | 6.5611175 |
n ddz | 1.08181 | 1.59610546 | 1.57796023 |
n delp2 | 2.59303 | 93.8961676 | 23.995005 |
n sheath | 0.15163 | 0.22712784 | 0.240881384 |
omega bracket | 3.58667 | 5.37294135 | 5.52158618 |
omega ddz | 1.09066 | 1.56167886 | 1.66319487 |
omega delp2 | 2.39409 | 93.4746835 | 23.8720075 |
omega sheath | 0.09619 | 0.13800992 | 0.155987478 |
Even with caching zlength, delp2 is an order of magnitude slower! Bracket and ddz are also significantly slower.
@d7919 Did you see better improvement than this with caching zlength? There's definitely a fair amount of jitter in my results here, but it also definitely looks like there's something else going on too.
@ZedThree when I used caching it seemed to pretty much perfectly recover the performance of master, although I didn't do this level of timing.
I did also have to build next as a static lib rather than shared. But that's the only other thing I changed.
This is how I'm building the two branches:
cmake . -B build_next_opt -GNinja \
-DCMAKE_BUILD_TYPE=Release -DCHECK=0 \
-DBUILD_SHARED_LIBS=on \
-DBOUT_USE_PETSC=ON -DPETSC_DIR:PATH=/home/peter/Tools/petsc/installed/3.15.0-openmpi4 \
-DBOUT_BUILD_EXAMPLES=on
and running like:
mpirun -np 16 ./blob2d -d delta_1 -time_report:show=true input:error_on_unused_options=false
I do only have 8 physical cores, so that is plausibly an issue?
I'm doing something like
cmake . -B build -DCMAKE_BUILD_TYPE=Release -DCHECK=0 -DBUILD_SHARED_LIBS=off
I then run the same thing with prefix etc. set in the blob2d dir. I then run with mpirun -np 32 --bind-to core ./build/blob2d -d delta_1
. This on a machine with 32 physical cores.
Running like mpirun -np 8 -bind to core
gives me:
Timer name | master | next | next + zlength |
---|---|---|---|
run | 41.173 | 300.705235 | 44.2209333 |
rhs | 27.6943 | 269.642543 | 30.8883797 |
invert | 9.07449 | 8.8451217 | 8.51863951 |
io | 0.84887 | 1.02272549 | 0.919394656 |
comms | 0.684314 | 0.783785886 | 0.645662398 |
n bracket | 5.09911 | 6.51896736 | 4.79567161 |
n ddz | 1.17155 | 1.52217002 | 1.09371776 |
n delp2 | 2.94725 | 122.334548 | 5.19087748 |
n sheath | 0.162111 | 0.24286286 | 0.14794967 |
omega bracket | 4.20319 | 5.61457184 | 3.95868073 |
omega ddz | 1.19902 | 1.54012749 | 1.12180734 |
omega delp2 | 2.82852 | 121.62549 | 5.08194543 |
omega sheath | 0.104803 | 0.158824281 | 0.084123548 |
Much better! I'm still a bit worried that we have lost a bit of performance over 4.4, but it's possible that it's in the noise. Certainly now looks like <10%
The Delp2
time still looks somewhat worse. I notice that halving the number of cores doesn't really change the delp2 time on master.
I've not timed things carefully enough to know if I saw the same behaviour with delp2 or not.
Those numbers are fairly robust, the jitter is maybe +/- 2s on the total run time, so 5%? delp2
time is outside of that, while most other times are within that. It looks like the delp2
difference can explain pretty much all of the total difference.
Yes, looks like hyperthreading makes no difference for master but is absolutely murdering delp2
in next.
I noticed some quite alarming performance regressions between master and next, around a factor 5 slow down. I've looked at the conduction and blob2d examples. Built with CMake
-DCMAKE_BUILD_TYPE=Release -DCHECK=0
.Same number of iterations, but rhs massively slower. Probably going to need to bisect to find out when this happened.
If other people could verify this too, that would be useful.