Open felker opened 4 years ago
Example of the current per-step (iteration) diagnostic output provided by FRNN around epoch 22 of the D3D 0D model (run on 4 V100 GPUs of Traverse):
[0] step: 0 [ETA: 468568011.02s] [0.00/1789], loss: 1.05701 [1.05701] | walltime: 5.7374 | 8.47E+02 Examples/sec | 6.04E-01 sec/batch [92.3% calc., 7.7% sync.][batch = 512 = 128*4] [lr = 7.30E-05 = 1.83E-05*4]
The ETA provided in this example is clearly inaccurate (each epoch takes around 60s). Specifically, there are two types of issues:
For the first epoch in a given session, it gives a huge ETA since MPI_Model.num_so_far is zero, resulting in work_so_far of 0 being passed to: https://github.com/PPPLDeepLearning/plasma-python/blob/c82ba61e339882a5af10b1052edc0348e16119f4/plasma/models/mpi_runner.py#L613-L616 causing total_time to explode.
MPI_Model.num_so_far
work_so_far
total_time
For later epochs within a session, it gives a minuscule ETA:
step: 0 [ETA: 0.55s] [1819.00/1789], loss: 0.98688 [0.98688] | walltime: 174.4240 | 8.93E+02 Examples/sec | 5.73E-01 sec/batch [96.1% calc., 3.9% sync.][batch = 512 = 128*4] [lr = 7.08E-05 = 1.77E-05*4]
E.g. here are the ETAs for some later epoch:
ETA: 0.55s ETA: 22.14 ETA: 27.98 ETA: 31.63 ETA: 35.88 ETA: 38.45 ETA: 34.89 ETA: 36.21 ETA: 35.35 ETA: 35.56 ETA: 36.04 ETA: 35.88 ETA: 35.33 ETA: 34.49 ETA: 34.73 ETA: 34.29 ETA: 34.13 ETA: 33.51 ETA: 33.16 … ETA: 1.35s ETA: 1.06s ETA: 0.67s ETA: 0.11s ETA: -0.45
Example of the current per-step (iteration) diagnostic output provided by FRNN around epoch 22 of the D3D 0D model (run on 4 V100 GPUs of Traverse):
The ETA provided in this example is clearly inaccurate (each epoch takes around 60s). Specifically, there are two types of issues:
First step
For the first epoch in a given session, it gives a huge ETA since
MPI_Model.num_so_far
is zero, resulting inwork_so_far
of 0 being passed to: https://github.com/PPPLDeepLearning/plasma-python/blob/c82ba61e339882a5af10b1052edc0348e16119f4/plasma/models/mpi_runner.py#L613-L616 causingtotal_time
to explode.For later epochs within a session, it gives a minuscule ETA:
Later steps in later epochs
E.g. here are the ETAs for some later epoch: