Open nrnhines opened 3 years ago
In lieu of directly calculating computation time used by each cell, it may be sufficient to know the number of instances and computation time of each mechanism (nrn_cur and nrn_solve) in use, as well as number of compartments and setup, gaussian elimination, update, and nonvint times. From that data, it my be possible to calculate a reasonable value for how much time a given distribution of cells will take.
This makes a lot of sense. It makes sense to start calculating load-imbalance. I would say it would be relatively low-overhead to add a few timers tracking the runtime of these compute routines and printing the load-(im)balance at the end.
There was some discussion on slack with Ivan and @pramodk about load balance for Ivan's full dentate gyrus model and Pramod showed some calliper profiling timings for 1000msec with coreneuron on 64 nodes of BB5 (each node has 40 CascadeLake cores and 384 GB DRAM). In part:
Path Min time/rank Max time/rank Avg time/rank Time %
simulation 8905.028183 8905.028699 8905.028382 95.284047
spike-exchange 364.348209 1557.676357 1157.063521 12.380611
spike-exchange 350.917333 1544.334754 1143.548880 12.236004
communication 41.933284 44.393275 43.104909 0.461224
imbalance 307.616220 1501.729722 1100.204786 11.772221
timestep 7346.060516 8540.169498 7747.328642 82.896628
I noted (barrier refers to spike-exchange imbalance):
Sorry I didn't notice that information was present. Looks like timestep and barrier are consistent since min,max timestep + max,min barrier
oc>7346.060516 + 1501.729722 8847.7902 oc>8540.169498 + 307.616220 8847.7857
But I do have to admit to some conceptual confusion on one point. Does the min barrier of 307 imply that that is the contribution of dynamic imbalance (since if one rank was always the slowest in terms of timestep, it would always have a barrier time of 0)?
A precise answer to that last question is still unclear to me :)
Also, there were detailed caliper timings for each mechanism, e.g.:
state-sKDR_Aradi 83.681012 181.048304 133.095886 1.424130
state-Na_PR 0.262366 0.849567 0.626852 0.006707
...
cur-sKDR_Aradi 219.977388 295.570776 248.885360 2.663080
cur-Na_PR 0.425234 0.662382 0.549710 0.005882
...
And it seemed to me that combining this with the instance counts of each mechanism would allow calculation (back in NEURON) of a pretty good proxy for calculating cell complexity and therefore an LPT (least processing time) cell distribution on threads and ranks. In particular because with a range of (instance count, time) for min, max, and average for each mechanism, one might have not just a time per instance but a function f(i) of time/per i instances that could take into account some otherwise vexing memory latency/bandwidth effects. (Speculative but possibly a nice paper there :)
The actual measure of load balance during simulation is an important statistic to gauge whether performance might be improved by better distribution of cells on mpi ranks. During a simulation, CoreNEURON should keep track of computation time used by each rank (as opposed to spike exchange time) and indicate the average and maximum computation time.
loadbalance = average/maximum
with a value of 1.0 being ideal.If possible, it would be very useful to determine the computation time used by each cell, since, with that data, one may be able to use the lpt (least processing time) algorithm to decide on a distribution of cells on ranks better than the default round-robin distribution and use the lpt distribution for subsequent simulation runs.