Pascalson / Conditional-Seq-GANs

GANs for Conditional Sequence Generation. Tensorflow. Including the code of paper "Improving Conditional Sequence Generative Adversarial Networks by Stepwise Evaluation" IEEE/ACM TASLP, 2019.
MIT License
34 stars 6 forks source link

numerics error (counting task) #1

Open Jakkque opened 5 years ago

Jakkque commented 5 years ago

Hey,

I try to use your algorithm for my own data. First, however, I try to get used to your code by training the network with the counting task, i.e.

$ bash run.sh  0x1B81 None StepGAN Counting

Unfortunately, I get an error while training. When I restart the training after the error, sometimes it works for another few steps, then the error occurs once again, e.g. after step 41600, I got this message:

global step 41600; learning rate 0.27084273; D lr 0.30555877; step-time 0.13;
perp -0.0040
0.013233515731119664
D-loss 0.1305
reward(D_fake_value) [2.25295870e-02 1.73941438e-02 2.30499967e-03 2.22686738e-03  6.23991021e-10]

2018-12-06 14:49:17.527221: E tensorflow/core/kernels/check_numerics_op.cc:185] abnormal_detected_host @0x10209c3a100 = {1, 0} Found Inf or NaN global norm.
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Found Inf or NaN global norm. : Tensor had NaN values
     [[{{node VerifyFinite_4/CheckNumerics}} = CheckNumerics[T=DT_FLOAT, message="Found Inf or NaN global norm.", _device="/job:localhost/replica:0/task:0/device:GPU:0"](global_norm_4/global_norm)]]
     [[{{node clip_by_global_norm_4/mul_8/_2349}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_7646_clip_by_global_norm_4/mul_8", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "main.py", line 34, in <module>
    train_gan()
  File "/home/user/codes/step_gan/train_gan_n_rl.py", line 140, in train_gan
    bucket_id, seq_lens, GAN_mode='D')
  File "/home/user/codes/step_gan/seq2seq_model_comp.py", line 720, in train_step
    outputs = sess.run(output_feed, input_feed)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Found Inf or NaN global norm. : Tensor had NaN values
     [[node VerifyFinite_4/CheckNumerics (defined at /home/user/codes/step_gan/seq2seq_model_comp.py:502)  = CheckNumerics[T=DT_FLOAT, message="Found Inf or NaN global norm.", _device="/job:localhost/replica:0/task:0/device:GPU:0"](global_norm_4/global_norm)]]
     [[{{node clip_by_global_norm_4/mul_8/_2349}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_7646_clip_by_global_norm_4/mul_8", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Caused by op 'VerifyFinite_4/CheckNumerics', defined at:
  File "main.py", line 34, in <module>
    train_gan()
  File "/home/user/codes/step_gan/train_gan_n_rl.py", line 66, in train_gan
    dtype=tf.float32)
  File "/home/user/codes/step_gan/seq2seq_model_comp.py", line 502, in __init__
    clipped_D_grads, _ = tf.clip_by_global_norm(D_grads, max_gradient_norm)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/clip_ops.py", line 265, in clip_by_global_norm
    "Found Inf or NaN global norm.")
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/numerics.py", line 47, in verify_tensor_all_finite
    verify_input = array_ops.check_numerics(t, message=msg)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 817, in check_numerics
    "CheckNumerics", tensor=tensor, message=message, name=name)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
    self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): Found Inf or NaN global norm. : Tensor had NaN values
     [[node VerifyFinite_4/CheckNumerics (defined at /home/user/codes/step_gan/seq2seq_model_comp.py:502)  = CheckNumerics[T=DT_FLOAT, message="Found Inf or NaN global norm.", _device="/job:localhost/replica:0/task:0/device:GPU:0"](global_norm_4/global_norm)]]
     [[{{node clip_by_global_norm_4/mul_8/_2349}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_7646_clip_by_global_norm_4/mul_8", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

The argmax results are still not convincing, so I think it should train longer. Did you ever had this issue and know how to solve it?

Cheers!

Pascalson commented 5 years ago

Hi,

Sorry for this late reply. I guess it is because the training has already failed. The perplexity has already become -0.004 , which is a really large number. Maybe you can print the results and see if the outputs are apparently fake.

During the training, I have to trace if the values of "perp", "D-loss", etc. are in a good way. If not, the training should be early stopped. Or I have to start training a new model.

If you want to train longer, the hyper-parameters might have to be fine-tuned and the initial random seed would also influence the results a lot.