Bugs: variable number of steps per epoch, and inaccurate diagnostic count of num_so_far shots

Observed on both TigerGPU and Traverse, on master before and after the merge of #46. If there is a serious underlying problem, it is possibly related to #55 with the changed indexing of the epochs.

The premise of this issue is that:

The number of steps (iterations) per training epoch should be roughly constant across all epochs.

However, I am not entirely sure that this premise is correct. See below section.

Mini-batches are created by distributing the trimmed and resampled shot signals into chunks of LSTM length, typically length=128 ms when dt=0.001; this is the horizontal dimension of a min-batch.

The other, vertical dimension of a mini-batch is the local batch_size. Ideally, each shot is uniquely "owned" by a single GPU (or model replica) for nsteps = nchunks, which depends on the particular pulse length. This varies by 1-2 orders of magnitude, with the minimum shot length = 2*length + T_min_warning = 280 ms, typically. Ignoring any nuanced trimming of the processed shots (i.e. I floored the division into chunks):

d3d_training_pulse_length_histogram

[ ] Double check the trimming of resampled shots in order to have an integer number of chunks. Is it trimmed only at the beginning of the shot? How does conf['training']['paths'] = True affect this?

From the Methods appendix of the Nature paper:

... Because there is a persistent internal state between successive chunks in time, it is not possible to use more than one chunk from a given shot in a given mini-batch (chunks that are successive in the shot must also be presented to the RNN in successive mini-batches during training such that the internal state can persist correctly).

To train batchwise with a batch size of M, we need M independent (that is, stemming from different shots) time slices of equal length to feed to the GPU.

However, if effective_batch_size = N_GPU*batch_size is greater than the number of training shots (1734 shots, which is easy to exceed with 4 GPUs and batch_size=512, e.g.), then each step must involve some shots appearing twice in the overall batch. Even for smaller effective batch sizes, the batch generator must backfill the mini-batches with repeats at later steps in the epoch as the longer pulses require many more steps to process all the chunks in the shot.

[ ] Double check how open batch-indices are handled near the end of an epoch.

For a recent experiment with 1 GPU, batch_size=256, D3D 0D training, the final step of the first epoch is written to stdout as:

step: 143 [ETA: -0.42s] [1794.00/1789], loss: 1.51778 [1.34746] | walltime: 167.4610 | 4.66E+02 Examples/sec | 5.49E-01 sec/batch [96.4% calc., 3.6% sync.][batch = 256 = 256*1] [lr = 2.00E-05 = 2.00E-05*1]

In this example, 1794.00 is MPIModel.num_so_far, which is always printed out with fractional precision, but never shows anything other than integer values. Note, 1789 shots is more than in the original D3D training set due to a change in signals.py that I was messing around with.

[ ] This might be because the value actually is incremented by 1 when a shot's first chunk initially appears in a mini-batch index. If so, either remove the fractional precision in the output, or modify the variable so that it accurately computes num_chunks_so_far/num_chunks_total_this_shot.

By searching the stdout of FRNN with grep -nriB 1 "seconds" output.txt, I observe 143, 151, 264, 263, and 266 steps for the first 5 epochs.

For 1 GPU and batch_size=128: 416, 529, 529, 529, 531 steps.
For 1 GPU and batch_size=128 (restarted/loaded epoch 1 weights): 416, 529, 529, 528, 529, 531.
For 4 GPU and batch_size=128: 74, 75, 133, 132 steps In other words, for the default PRNG seed, the initial epochs within a training session shuffle mini-batches far closer to the optimal schedule than the later epochs. See below analysis.

This variation had not really been noticed in earlier training by @jnkh nor @ge-dong, since conf['training']['num_batches_minimum'] = 200 in their tests (as opposed to the default value of 20 in the repository's conf.yaml), which is much larger than the typical number of required steps for an epoch of 128ms chunks of our original D3D dataset and effective_batch_size=512

[ ] Rename num_batches_minimum to num_steps_minimum?

It is unclear if the above variable-step phenomenon was happening on older versions of FRNN and was masked by this parameter. However, I did confirm that the code has always been printing out .00 values for MPIModel.num_so_far.

I am not sure if this phenomenon has affected training accuracy at all.

Multiprocessor scheduling problem

The random shuffling of shots loaded into mini-batches is effectively the List Scheduling algorithm applied to a shuffled list of jobs (shots) j_i of variable sizes = nchunks_i. Each batch index in effective_batch_size is an independent, identical "worker". The multiprocessor scheduling problem seeks to find the optimal schedule of j_i assigned to each of the m workers in order to minimize the makespan = the earliest time at which all jobs are completed. Here, we have no inter-job precedence/dependencies, nor individual worker constraints. Still, this problem is strongly NP-hard, since the decision problem variant ("Does a feasible schedule S exist that satisfies f(S)<= k?" for a given threshold k) is NP-complete.

In this particular heuristic algorithm for the scheduling problem, each next job (shot) is assigned to the worker (batch index) which becomes empty soonest, given some arbitrary ordered input (training buffer). In the worst-case for the List Scheduling algorithm, the longest shot is loaded into a mini-batch last, and the makespan is maximized. Hence, it returns a makespan which within a factor of 2 - 1/m of the optimal value.

By contrast, the Longest Processing Time First Rule (LPT) algorithm first sorts the jobs according to non-increasing processing time (largest nchunks to smallest) and returns a makespan within a factor of 4/3-1/(3*m) of the optimal makespan.

Note, we are not trying to minimize the makespan or find the most efficient mini-batching strategy in FRNN, since we rely on the random shuffling to stabilize training. However, this analysis applied to our D3D chunks can give us some expectation on how much variability in steps/epoch is normal.

Here, I apply both algorithms to the D3D training set:

import numpy as np
import plasma
from plasma.primitives.shots import ShotList
import time
import sys
np.set_printoptions(threshold=sys.maxsize)

shot_list_path = '../processed_shotlists/d3d_0D/shot_lists_signal_group_250640798211266795112500621861190558178.npz'
data = np.load(shot_list_path, allow_pickle=True)
shot_list_train = data['shot_list_train'][()]
shot_list_train = ShotList(shot_list_train)
prepath='/Users/felker/processed_shots/signal_group_250640798211266795112500621861190558178/'
shot_list_train.shots[0].restore(prepath=prepath, light=False)

T_RNN = 128   # "length of LSTM" = truncated BPPT chunk
nchunks = timesteps/T_RNN
effective_batch_size = 1*128

# LPT
nchunks_sorted = np.sort(np.floor(nchunks))[::-1]
loads = np.zeros((effective_batch_size,))
scheduled_jobs = np.empty((effective_batch_size,), dtype=object)
scheduled_jobs[...] = [[] for _ in range(effective_batch_size)]
for job in nchunks_sorted:
    minload_batch_index = np.argmin(loads)
    scheduled_jobs[minload_batch_index].append(job)
    loads[minload_batch_index] += job
print("LPT makespan = {}".format(loads.max()))
print("Optimal job schedule (makespan) >= {}".format(loads.max()/(4.0/3.0) - 1.0/(3.0*effective_batch_size)))

# List scheduling 
np.random.seed(int(time.time()))
nchunks_shuffled = np.floor(nchunks)
np.random.shuffle(nchunks_shuffled)
scheduled_jobs = np.empty((effective_batch_size,), dtype=object)
scheduled_jobs[...] = [[] for _ in range(effective_batch_size)]
for job in nchunks_shuffled:
    minload_batch_index = np.argmin(loads)
    scheduled_jobs[minload_batch_index].append(job)
    loads[minload_batch_index] += job
print("Random List Scheduling makespan = {}".format(loads.max()))
print("Optimal job schedule (makespan) >= {}".format(loads.max()/(2.0 - 1.0/effective_batch_size)))

For effective_batch_size = 512:

LPT makespan = 132.0
Optimal job schedule (makespan) >= 98.99934895833333
Random List Scheduling makespan = 168.0
Optimal job schedule (makespan) >= 84.08211143695014

For effective_batch_size = 256:

LPT makespan = 256.0
Optimal job schedule (makespan) >= 191.99869791666666
Random List Scheduling makespan = 290.0
Optimal job schedule (makespan) >= 145.28375733855185

For effective_batch_size = 128:

LPT makespan = 505.0
Optimal job schedule (makespan) >= 378.7473958333333
Random List Scheduling makespan = 529.0
Optimal job schedule (makespan) >= 265.5372549019608

The latter two cases are in line with my observations (although these were computed from a slightly different training set; see above comment about changes to signals.py on Traverse). Therefore, this variability of nsteps/epoch might be expected, and not a bug.

PPPLDeepLearning / plasma-python

Bugs: variable number of steps per epoch, and inaccurate diagnostic count of num_so_far shots #63

Multiprocessor scheduling problem