Open kamal94 opened 8 years ago
whoa, this is weird. I'm sure I ran the model several times both on 1080 and TitanX gpus without getting NaN. The problem might not be in the data, otherwise people training the steering model would have complained as well.
May I ask you what is your gpu and TF version?
by any chance, would you have multigpu setup and are asking TF to use only one GPU?
also, are you able to continue training from the checkpoint? if you try to continue training, does it crash in the same point again? I remember getting random crashes due to TF rounding problems but I could continue training from the checkpoint.
Graphics card: GTX 1060 TF: tensorflow (0.10.0rc0) Cuda compilation tools, release 7.5, V7.5.17 cuDnn version 4
I only have 1 GPU, and am using it for training.
I am not sure how to continue training from a checkpoint. I wasn't aware TF automatically creates checkpoints. I have simply been restarting the server and running the training again from scratch everytime i get this error. (By the way It seems to be almost finished now at epoch 195, so fingers crossed.) I just don't think its safe to leave a bug (if it exists) like this laying around, since it could waste days of training.
For more info, i trained this on a Nvidia Tesla K20 and although it was slower than my 1060, it worked the first time without any errors. Again, I'm kind of scared that this might be a randomly created error, which can make it hard to hunt down.
tensorflow does not do that automatically.
but our code does. Add the flag --loadweights
continue from a checkpoint:
https://github.com/commaai/research/blob/master/train_generative_model.py#L137
Yeah, I guess its some rounding error in TF beyond my reach for now... But let me know if the checkpoint thing works for you.
how do you train the train_generative_model.py autoencoder successfully ,i meet some difficuty , have to doing somehting in code?thanks
I had hoped I could solve this for myself, but I regrettably couldn't, so I'm hoping someone here knows how to fix this:
When training the autoencoder as prescribed by the DriveSimulator.md file,
I get a NaN error by tensorflow. This is a completely unpredictable error and it happens in different epochs everytime i try to train again.
Here is my output:
Again, this happens randomly in different epochs (1,3, 18, or 23). I can only get so far in the training before I get this error. Any ideas? I tried setting the learning rate to 0.0001 but this error persisted.