The usual, expected behaviour is that when the training has finished, all the resource (GPU) is freed up and the docker container doesn't run anymore. And this does happen in most cases.
However, sometimes (randomly?) the resource is not freed up. I guess the frequency of unexpected behaviour is maybe 1 in 5. I'm not sure though. This is a potential problem for anyone using "pay-as-you-go" computing resource to train their model.
The usual, expected behaviour is that when the training has finished, all the resource (GPU) is freed up and the docker container doesn't run anymore. And this does happen in most cases.
However, sometimes (randomly?) the resource is not freed up. I guess the frequency of unexpected behaviour is maybe 1 in 5. I'm not sure though. This is a potential problem for anyone using "pay-as-you-go" computing resource to train their model.
A recent example is that I was running training on a modified lm_resnet_quantize_cifar10.py config file for ilsvrc_2012 dataset, using the dataset_iterator with multigpu and prefetch (I'm not sure if any of these things are relevant to the issue).
An example of the output of a training run that failed to free resource is:
and an example of the output of a training run that did free resource successfully is:
Comparing between them, the first only has three
Done
printout, while the second has four. This line of code is at https://github.com/blue-oil/blueoil/blob/002f5408d11fd2a93457c5e7bc8dc876ace13ce0/blueoil/cmd/train.py#L283 which is after the final progbar update. So it seems likely that the problem was that one of the DatasetIterators did not close.