PPPLDeepLearning / plasma-python

PPPL deep learning disruption prediction package
http://tigress-web.princeton.edu/~alexeys/docs-web/html/
79 stars 43 forks source link

ETA calculation is inaccurate #55

Open felker opened 4 years ago

felker commented 4 years ago

Example of the current per-step (iteration) diagnostic output provided by FRNN around epoch 22 of the D3D 0D model (run on 4 V100 GPUs of Traverse):

[0] step: 0 [ETA: 468568011.02s] [0.00/1789], loss: 1.05701 [1.05701] | walltime: 5.7374 | 8.47E+02 Examples/sec | 6.04E-01 sec/batch [92.3% calc., 7.7% sync.][batch = 512 = 128*4] [lr = 7.30E-05 = 1.83E-05*4]

The ETA provided in this example is clearly inaccurate (each epoch takes around 60s). Specifically, there are two types of issues:

  1. The ETA computed in the first step of any epoch is always inaccurate.
  2. For later epochs within a session, the ETA increases nearly monotonically for many steps before starting to decrease nearly monotonically.

First step

For the first epoch in a given session, it gives a huge ETA since MPI_Model.num_so_far is zero, resulting in work_so_far of 0 being passed to: https://github.com/PPPLDeepLearning/plasma-python/blob/c82ba61e339882a5af10b1052edc0348e16119f4/plasma/models/mpi_runner.py#L613-L616 causing total_time to explode.

For later epochs within a session, it gives a minuscule ETA:

step: 0 [ETA: 0.55s] [1819.00/1789], loss: 0.98688 [0.98688] | walltime: 174.4240 | 8.93E+02 Examples/sec | 5.73E-01 sec/batch [96.1% calc., 3.9% sync.][batch = 512 = 128*4] [lr = 7.08E-05 = 1.77E-05*4]

Later steps in later epochs

E.g. here are the ETAs for some later epoch:


ETA: 0.55s
ETA: 22.14
ETA: 27.98
ETA: 31.63
ETA: 35.88
ETA: 38.45
ETA: 34.89
ETA: 36.21
ETA: 35.35
ETA: 35.56
ETA: 36.04
ETA: 35.88
ETA: 35.33
ETA: 34.49
ETA: 34.73
ETA: 34.29
ETA: 34.13
ETA: 33.51
ETA: 33.16
…
ETA: 1.35s
ETA: 1.06s
ETA: 0.67s
ETA: 0.11s
ETA: -0.45