Randomly resource not released after training sometimes

The usual, expected behaviour is that when the training has finished, all the resource (GPU) is freed up and the docker container doesn't run anymore. And this does happen in most cases.

However, sometimes (randomly?) the resource is not freed up. I guess the frequency of unexpected behaviour is maybe 1 in 5. I'm not sure though. This is a potential problem for anyone using "pay-as-you-go" computing resource to train their model.

A recent example is that I was running training on a modified lm_resnet_quantize_cifar10.py config file for ilsvrc_2012 dataset, using the dataset_iterator with multigpu and prefetch (I'm not sure if any of these things are relevant to the issue).

An example of the output of a training run that failed to free resource is:

[1,1]<stderr>:  warnings.warn(str(msg))
[1,0]<stdout>:1600000/1600000 [==============================] - 122832s 77ms/steput>::::>:
[1,3]<stdout>:break
[1,2]<stdout>:break
[1,1]<stdout>:break
[1,3]<stdout>:Done
[1,2]<stdout>:Done
[1,2]<stdout>:Next step: blueoil convert -e my_model -p save.ckpt-1600000
[1,3]<stdout>:Next step: blueoil convert -e my_model -p save.ckpt-1600000
[1,1]<stdout>:Done
[1,1]<stdout>:Next step: blueoil convert -e my_model -p save.ckpt-1600000

and an example of the output of a training run that did free resource successfully is:

[1,3]<stderr>:  warnings.warn(str(msg))
[1,0]<stdout>:1599999/1600000 [============================>.] - ETA: 0s[1,2]<stdout>:break
[1,0]<stdout>:1600000/1600000 [==============================] - 131360s 82ms/step
[1,2]<stdout>:Done
[1,0]<stdout>:break
[1,2]<stdout>:Next step: blueoil convert -e another_model -p save.ckpt-1600000
[1,0]<stdout>:Done
[1,0]<stdout>:Next step: blueoil convert -e another_model -p save.ckpt-1600000
[1,3]<stdout>:break
[1,1]<stdout>:break
[1,3]<stdout>:Done
[1,1]<stdout>:Done
[1,3]<stdout>:Next step: blueoil convert -e another_model -p save.ckpt-1600000
[1,1]<stdout>:Next step: blueoil convert -e another_model -p save.ckpt-1600000

Comparing between them, the first only has three Done printout, while the second has four. This line of code is at https://github.com/blue-oil/blueoil/blob/002f5408d11fd2a93457c5e7bc8dc876ace13ce0/blueoil/cmd/train.py#L283 which is after the final progbar update. So it seems likely that the problem was that one of the DatasetIterators did not close.

blue-oil / blueoil

Randomly resource not released after training sometimes #1227