BlueBrain / CoreNeuron

Simulator optimized for large scale neural network simulations.
BSD 3-Clause "New" or "Revised" License
136 stars 39 forks source link

CoreNEURON calculation of load balance for mpi simulation. #676

Open nrnhines opened 3 years ago

nrnhines commented 3 years ago

The actual measure of load balance during simulation is an important statistic to gauge whether performance might be improved by better distribution of cells on mpi ranks. During a simulation, CoreNEURON should keep track of computation time used by each rank (as opposed to spike exchange time) and indicate the average and maximum computation time. loadbalance = average/maximum with a value of 1.0 being ideal.

If possible, it would be very useful to determine the computation time used by each cell, since, with that data, one may be able to use the lpt (least processing time) algorithm to decide on a distribution of cells on ranks better than the default round-robin distribution and use the lpt distribution for subsequent simulation runs.

nrnhines commented 3 years ago

In lieu of directly calculating computation time used by each cell, it may be sufficient to know the number of instances and computation time of each mechanism (nrn_cur and nrn_solve) in use, as well as number of compartments and setup, gaussian elimination, update, and nonvint times. From that data, it my be possible to calculate a reasonable value for how much time a given distribution of cells will take.

ohm314 commented 3 years ago

This makes a lot of sense. It makes sense to start calculating load-imbalance. I would say it would be relatively low-overhead to add a few timers tracking the runtime of these compute routines and printing the load-(im)balance at the end.

nrnhines commented 3 years ago

There was some discussion on slack with Ivan and @pramodk about load balance for Ivan's full dentate gyrus model and Pramod showed some calliper profiling timings for 1000msec with coreneuron on 64 nodes of BB5 (each node has 40 CascadeLake cores and 384 GB DRAM). In part:

Path                                   Min time/rank  Max time/rank   Avg time/rank Time %
    simulation                           8905.028183   8905.028699   8905.028382 95.284047
      spike-exchange                      364.348209   1557.676357   1157.063521 12.380611
        spike-exchange                    350.917333   1544.334754   1143.548880 12.236004
          communication                    41.933284     44.393275     43.104909  0.461224
          imbalance                       307.616220   1501.729722   1100.204786 11.772221
      timestep                           7346.060516   8540.169498   7747.328642 82.896628

I noted (barrier refers to spike-exchange imbalance):

Sorry I didn't notice that information was present. Looks like timestep and barrier are consistent since min,max timestep + max,min barrier

oc>7346.060516 + 1501.729722
8847.7902 
oc>8540.169498 + 307.616220
8847.7857

But I do have to admit to some conceptual confusion on one point. Does the min barrier of 307 imply that that is the contribution of dynamic imbalance (since if one rank was always the slowest in terms of timestep, it would always have a barrier time of 0)?

A precise answer to that last question is still unclear to me :)

Also, there were detailed caliper timings for each mechanism, e.g.:

          state-sKDR_Aradi                 83.681012    181.048304    133.095886  1.424130
          state-Na_PR                       0.262366      0.849567      0.626852  0.006707
 ...
          cur-sKDR_Aradi                  219.977388    295.570776    248.885360  2.663080
          cur-Na_PR                         0.425234      0.662382      0.549710  0.005882
...

And it seemed to me that combining this with the instance counts of each mechanism would allow calculation (back in NEURON) of a pretty good proxy for calculating cell complexity and therefore an LPT (least processing time) cell distribution on threads and ranks. In particular because with a range of (instance count, time) for min, max, and average for each mechanism, one might have not just a time per instance but a function f(i) of time/per i instances that could take into account some otherwise vexing memory latency/bandwidth effects. (Speculative but possibly a nice paper there :)