Batch iterator/stability of tests on Titan at very large number of GPUs - Githubissues

PPPLDeepLearning / plasma-python

PPPL deep learning disruption prediction package

http://tigress-web.princeton.edu/~alexeys/docs-web/html/

79 stars 43 forks source link

Batch iterator/stability of tests on Titan at very large number of GPUs #9

Closed ASvyatkovskiy closed 7 years ago

ASvyatkovskiy commented 7 years ago

This PR summarizes some work performed during Julian's visit.

A set of modifications to the MPIModel class regarding the batch iterator. Our batch iterator is infinite - it has a while True statement. Epoch are typically stopped when global step variable (num_so_far) reaches the sample size. Modifications are targeted to not reset the batch iterator at the end of epoch, in short, the batch iterator is moved outside the epoch loop, but we keep a copy of the batch iterator object as a data member of the MPIModel class (in addition to batch generator function)
```
class MPIModel():
def __init__(self,model,optimizer,comm,batch_iterator,batch_size,num_replicas=None,warmup_steps=1000,lr=0.01):
...
self.batch_iterator_func = batch_iterator()
self.batch_iterator = batch_iterator
```

As a result num_so_far needs to be epoch aware:

num_so_far ---> num_so_far-self.epoch*num_total

A set of modifications in the mpi_runner to improve stability of of tests on Titan with very large number of GPUs - in the regime of data starvation (resulting in a very few mini-batches per epoch):
```
while (num_so_far-self.epoch*num_total) < num_total or num_batches_current < num_batches_minimum:
   #trainin the epoch
```