Training time - Githubissues

Chung-I / Variational-Recurrent-Autoencoder-Tensorflow

A tensorflow implementation of "Generating Sentences from a Continuous Space"

228 stars 72 forks source link

Training time #7

Open zzaibi opened 6 years ago

zzaibi commented 6 years ago

It's been training on a single-GPU machine for 3 days, and there is no sign of finishing. How long should it take?

superMDguy commented 6 years ago

It trains forever. It saves periodically, and you can kill it whenever you think it's "done" based on the loss level.

zzaibi commented 6 years ago

Problem is, it only saves a very small number of check points. I can't go back, right?

superMDguy commented 6 years ago

By default, it checkpoints every 2000 steps, which I think is pretty infrequent. You can change the steps_per_checkpoint in models/config.json to a lower number to make it save more often. Since it saves a checkpoint and prints loss information at the same time, I'd wait until you see new loss information before killing it to minimize the amount of lost training time.

zzaibi commented 6 years ago

Understood. The problem is that it seems to delete old check points, and save only the most recent few. I guess I've lost the optimal point.

superMDguy commented 6 years ago

Oh, ok. I l looked into it, and if you change this line locally to include a max_to_keep argument, it should stop deleting old checkpoints. I haven't tried it, though, and you'd have to restart training to test it out. You can read the docs for the saver here for more details.