Open zzaibi opened 6 years ago
It trains forever. It saves periodically, and you can kill it whenever you think it's "done" based on the loss level.
Problem is, it only saves a very small number of check points. I can't go back, right?
By default, it checkpoints every 2000 steps, which I think is pretty infrequent. You can change the steps_per_checkpoint
in models/config.json
to a lower number to make it save more often. Since it saves a checkpoint and prints loss information at the same time, I'd wait until you see new loss information before killing it to minimize the amount of lost training time.
Understood. The problem is that it seems to delete old check points, and save only the most recent few. I guess I've lost the optimal point.
It's been training on a single-GPU machine for 3 days, and there is no sign of finishing. How long should it take?