Closed E-Coli-BW closed 4 years ago
PS: I have made the cmd option --restore's default value to be True so I believe it should always try to restore from the latest checkpoint?
OK, I finally found the annoying bug: in train.py, after you initialize saver but before you pass your saver into train_epoch function, restore your saver using the checkpoint file you have. If you don't do that, the code will first attempt to save the BLANK saver as a new checkpoint before trying to restore it from the previous(latest) checkpoint. Also, the training in mid to late stages, will freeze (CPU and GPU utilization dropped to 0%). To prevent this from happening, you can reset your graph tf.reset_default_graph(), this will clear the graph but not affecting the actual checkpoint
I tried running the code for BigGAN, and from the images generated it was good. I also checked the checkpoint folder and found that it stores model checkpoint and a temp checkpoint The problem is, although we have all these checkpoints which should be able to restore training after we restart the program, in reality, it always start training from scratch, like the checkpoints never existed. I do see code for saver.save(model) and saver.save(temp), I also see sth like saver.restore(), which theoretically should restore from the last training epoch. It did show a bunch of " loading weights success .... " when I restart the training, but the training is actually starting from scratch... Does anyone have any idea why it is so? 训练的时候时正常的,我也在代码里也看到有保存tmp checkpoint和model的checkpoint, checkpoint文件夹里也确实有ckpt files, 但每次我重新开始训练即使显示了在使用/恢复了checkpoint实际的效果却总是从头开始训练。。。。请问有人碰到过同样的问题吗?