Open liubo0902 opened 5 years ago
There is a tiny bug in the function validation() within train.py. save_checkpoint() should just be implemented when save_to_disk is True.
True. This is a bug when using distributed computing. Due to simultaneous writes, the checkpoint file is getting corrupted. The fix is as you suggested which saves the checkpoint only for rank = 0.
There is a tiny bug in the function validation() within train.py. save_checkpoint() should just be implemented when save_to_disk is True.