NaN error when training autoencoder

kamal94 commented 8 years ago

I had hoped I could solve this for myself, but I regrettably couldn't, so I'm hoping someone here knows how to fix this:

When training the autoencoder as prescribed by the DriveSimulator.md file,

I get a NaN error by tensorflow. This is a completely unpredictable error and it happens in different epochs everytime i try to train again.

Here is my output:

Epoch 1/200
   64/10000 [..............................] - ETA: 1903s - g_loss: 4.4450 - d_loss: 5.4199 - d_loss_fake: 4.4598 - d_loss_legit: 0.9601 - time: 10.4118I tensorflow/core/common_runtime/gpu/pool_$
llocator.cc:244] PoolAllocator: After 2061 get requests, put_count=2041 evicted_count=1000 eviction_rate=0.489956 and unsatisfied allocation rate=0.543426
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:256] Raising pool_size_limit_ from 100 to 110
 1280/10000 [==>...........................] - ETA: 343s - g_loss: 5.1531 - d_loss: 2.9220 - d_loss_fake: 1.1781 - d_loss_legit: 1.7439 - time: 2.3984I tensorflow/core/common_runtime/gpu/pool_al$
ocator.cc:244] PoolAllocator: After 5407 get requests, put_count=5279 evicted_count=1000 eviction_rate=0.18943 and unsatisfied allocation rate=0.212872
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:256] Raising pool_size_limit_ from 256 to 281
 3584/10000 [=========>....................] - ETA: 221s - g_loss: 6.1948 - d_loss: 2.4484 - d_loss_fake: 1.0048 - d_loss_legit: 1.4435 - time: 2.1499W tensorflow/core/framework/op_kernel.cc:936]
 Invalid argument: Nan in summary histogram for: HistogramSummary
         [[Node: HistogramSummary = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](HistogramSummary/tag, autoencoder/add_28/_221)]]
W tensorflow/core/framework/op_kernel.cc:936] Invalid argument: Nan in summary histogram for: HistogramSummary
         [[Node: HistogramSummary = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](HistogramSummary/tag, autoencoder/add_28/_221)]]
W tensorflow/core/framework/op_kernel.cc:936] Invalid argument: Nan in summary histogram for: HistogramSummary
         [[Node: HistogramSummary = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](HistogramSummary/tag, autoencoder/add_28/_221)]]
W tensorflow/core/framework/op_kernel.cc:936] Invalid argument: Nan in summary histogram for: HistogramSummary
         [[Node: HistogramSummary = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](HistogramSummary/tag, autoencoder/add_28/_221)]]
W tensorflow/core/framework/op_kernel.cc:936] Invalid argument: Nan in summary histogram for: HistogramSummary
         [[Node: HistogramSummary = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](HistogramSummary/tag, autoencoder/add_28/_221)]]
W tensorflow/core/framework/op_kernel.cc:936] Invalid argument: Nan in summary histogram for: HistogramSummary
         [[Node: HistogramSummary = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](HistogramSummary/tag, autoencoder/add_28/_221)]]
E tensorflow/core/client/tensor_c_api.cc:485] Nan in summary histogram for: HistogramSummary
         [[Node: HistogramSummary = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](HistogramSummary/tag, autoencoder/add_28/_221)]]
Traceback (most recent call last):
  File "./train_generative_model.py", line 168, in <module>
    nb_epoch=args.epoch, verbose=1, saver=saver
  File "./train_generative_model.py", line 85, in train_model
    g_loss, samples, xs = g_train(x, z, counter)
  File "/home/kamal/Desktop/research/models/autoencoder.py", line 241, in train_g
    outs = sess.run(outputs + updates, feed_dict={Img: images, Z: z, Z2: z2, K.learning_phase(): 1})
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 382, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 655, in _run
    feed_dict_string, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 723, in _do_run
    target_list, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 743, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors.InvalidArgumentError: Nan in summary histogram for: HistogramSummary
         [[Node: HistogramSummary = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](HistogramSummary/tag, autoencoder/add_28/_221)]]
Caused by op u'HistogramSummary', defined at:
  File "./train_generative_model.py", line 159, in <module>

   g_train, d_train, sampler, saver, loader, extras = get_model(sess=sess, name=args.name, batch_size=args.batch, gpu=args.gpu)
  File "/home/kamal/Desktop/research/models/autoencoder.py", line 204, in get_model
    sum_e_mean = tf.histogram_summary("e_mean", E_mean)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/logging_ops.py", line 125, in histogram_summary
    tag=tag, values=values, name=scope)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_logging_ops.py", line 100, in _histogram_summary
    name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 703, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2310, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1232, in __init__
    self._traceback = _extract_stack()

Again, this happens randomly in different epochs (1,3, 18, or 23). I can only get so far in the training before I get this error. Any ideas? I tried setting the learning rate to 0.0001 but this error persisted.

EderSantana commented 8 years ago

whoa, this is weird. I'm sure I ran the model several times both on 1080 and TitanX gpus without getting NaN. The problem might not be in the data, otherwise people training the steering model would have complained as well.

May I ask you what is your gpu and TF version?

EderSantana commented 8 years ago

by any chance, would you have multigpu setup and are asking TF to use only one GPU?

also, are you able to continue training from the checkpoint? if you try to continue training, does it crash in the same point again? I remember getting random crashes due to TF rounding problems but I could continue training from the checkpoint.

kamal94 commented 8 years ago

Graphics card: GTX 1060 TF: tensorflow (0.10.0rc0) Cuda compilation tools, release 7.5, V7.5.17 cuDnn version 4

I only have 1 GPU, and am using it for training.

I am not sure how to continue training from a checkpoint. I wasn't aware TF automatically creates checkpoints. I have simply been restarting the server and running the training again from scratch everytime i get this error. (By the way It seems to be almost finished now at epoch 195, so fingers crossed.) I just don't think its safe to leave a bug (if it exists) like this laying around, since it could waste days of training.

For more info, i trained this on a Nvidia Tesla K20 and although it was slower than my 1060, it worked the first time without any errors. Again, I'm kind of scared that this might be a randomly created error, which can make it hard to hunt down.

EderSantana commented 8 years ago

tensorflow does not do that automatically. but our code does. Add the flag --loadweights continue from a checkpoint: https://github.com/commaai/research/blob/master/train_generative_model.py#L137

Yeah, I guess its some rounding error in TF beyond my reach for now... But let me know if the checkpoint thing works for you.

zhaohuaqing1993 commented 7 years ago

how do you train the train_generative_model.py autoencoder successfully ,i meet some difficuty , have to doing somehting in code?thanks

commaai / research

NaN error when training autoencoder #24