train issues - Githubissues

When I train my data, the loss rate became "nan"

Epoch: 0 Step: 200 / 218 time: 1.800146 s init_v_loss: 0.11003795 mean_v_loss: 0.11003795 Epoch: 0 Step: 201 / 218 time: 1.800304 s init_v_loss: 0.00626686 mean_v_loss: 0.05815240 Epoch: 0 Step: 202 / 218 time: 1.804186 s init_v_loss: 0.10782523 mean_v_loss: 0.07471001 Epoch: 0 Step: 203 / 218 time: 1.807079 s init_v_loss: 0.02169361 mean_v_loss: 0.06145591 Epoch: 0 Step: 204 / 218 time: 1.794902 s init_v_loss: nan mean_v_loss: nan Epoch: 0 Step: 205 / 218 time: 1.804291 s init_v_loss: 0.05617625 mean_v_loss: nan Epoch: 0 Step: 206 / 218 time: 1.793242 s init_v_loss: nan mean_v_loss: nan Epoch: 0 Step: 207 / 218 time: 1.798064 s init_v_loss: nan mean_v_loss: nan Epoch: 0 Step: 208 / 218 time: 1.798309 s init_v_loss: 0.02277363 mean_v_loss: nan Epoch: 0 Step: 209 / 218 time: 1.797415 s init_v_loss: 0.10808768 mean_v_loss: nan

who can help me?

and when I finished my first epoch, the training will be interrupted. I get errors when I start training again. like: WARNING:tensorflow:From /home/luda403/AnimeGAN-master/tools/data_loader.py:76: DatasetV1.make_one_shot_iterator (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version. Instructions for updating: Use for ... in dataset: to iterate over a dataset. If using tf.estimator, return the Dataset object directly from your input function. As a last resort, you can use tf.compat.v1.data.make_one_shot_iterator(dataset). [] Reading checkpoints... [] Success to read AnimeGAN.model-0 [] Load SUCCESS 2024-03-28 12:13:54.255522: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 2024-03-28 12:13:54.573781: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2024-03-28 12:16:58.252940: W tensorflow/stream_executor/cuda/redzone_allocator.cc:312] Resource exhausted: /tmp/tempfile-dcxx02-8ffd700-200527-614b77fefbdf8; No space left on device Relying on driver to perform ptx compilation. This message will be only logged once. 2024-03-28 12:17:10.050582: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED 2024-03-28 12:17:10.050867: I tensorflow/stream_executor/stream.cc:4976] [stream=0x563a23f10970,impl=0x563a23f10400] did not memset GPU location; source: 0x7f4909ffcb10; size: 8388608; pattern: ffffffff 2024-03-28 12:17:10.050880: I tensorflow/stream_executor/stream.cc:4976] [stream=0x563a23f10970,impl=0x563a23f10400] did not memset GPU location; source: 0x7f4909ffcb30; size: 8388608; pattern: ffffffff 2024-03-28 12:17:10.050927: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at conv_ops.cc:1006 : Not found: No algorithm worked! Traceback (most recent call last): File "/home/dcxx/local/miniconda3/envs/AnimeGANv2/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call return fn(args) File "/home/dcxx/local/miniconda3/envs/AnimeGANv2/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn target_list, run_metadata) File "/home/dcxx/local/miniconda3/envs/AnimeGANv2/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found. (0) Internal: Blas GEMM launch failed : a.shape=(196608, 3), b.shape=(3, 3), m=196608, n=3, k=3 [[{{node Tensordot/MatMul}}]] [[generator/G_MODEL/Tanh/_1417]] (1) Internal: Blas GEMM launch failed : a.shape=(196608, 3), b.shape=(3, 3), m=196608, n=3, k=3 [[{{node Tensordot/MatMul}}]] 0 successful operations. 0 derived errors ignored.

  During handling of the above exception, another exception occurred:

  Traceback (most recent call last):
    File "train.py", line 100, in <module>
      main()
    File "train.py", line 94, in main
      gan.train()
    File "/home/luda403/AnimeGAN-master/AnimeGAN.py", line 258, in train
      self.Generator_loss, self.G_loss_merge], feed_dict = train_feed_dict)
    File "/home/dcxx/local/miniconda3/envs/AnimeGANv2/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 956, in run
      run_metadata_ptr)
    File "/home/dcxx/local/miniconda3/envs/AnimeGANv2/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
      feed_dict_tensor, options, run_metadata)
    File "/home/dcxx/local/miniconda3/envs/AnimeGANv2/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
      run_metadata)
    File "/home/dcxx/local/miniconda3/envs/AnimeGANv2/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
      raise type(e)(node_def, op, message)
  tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
    (0) Internal: Blas GEMM launch failed : a.shape=(196608, 3), b.shape=(3, 3), m=196608, n=3, k=3
           [[node Tensordot/MatMul (defined at /home/dcxx/local/miniconda3/envs/AnimeGANv2/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
           [[generator/G_MODEL/Tanh/_1417]]
    (1) Internal: Blas GEMM launch failed : a.shape=(196608, 3), b.shape=(3, 3), m=196608, n=3, k=3
           [[node Tensordot/MatMul (defined at /home/dcxx/local/miniconda3/envs/AnimeGANv2/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
  0 successful operations.
  0 derived errors ignored.

  Original stack trace for 'Tensordot/MatMul':
    File "train.py", line 100, in <module>
      main()
    File "train.py", line 89, in main
      gan.build_model()
    File "/home/luda403/AnimeGAN-master/AnimeGAN.py", line 160, in build_model
      t_loss = self.con_weight * c_loss + self.sty_weight * s_loss + color_loss(self.real,self.generated) * self.color_weight
    File "/home/luda403/AnimeGAN-master/tools/ops.py", line 278, in color_loss
      con = rgb2yuv(con)
    File "/home/luda403/AnimeGAN-master/tools/ops.py", line 295, in rgb2yuv
      return tf.image.rgb_to_yuv(rgb)
    File "/home/dcxx/local/miniconda3/envs/AnimeGANv2/lib/python3.6/site-packages/tensorflow_core/python/ops/image_ops_impl.py", line 2930, in rgb_to_yuv
      return math_ops.tensordot(images, kernel, axes=[[ndims - 1], [0]])
    File "/home/dcxx/local/miniconda3/envs/AnimeGANv2/lib/python3.6/site-packages/tensorflow_core/python/ops/math_ops.py", line 4071, in tensordot
      ab_matmul = matmul(a_reshape, b_reshape)
    File "/home/dcxx/local/miniconda3/envs/AnimeGANv2/lib/python3.6/site-packages/tensorflow_core/python/util/dispatch.py", line 180, in wrapper
      return target(*args, **kwargs)
    File "/home/dcxx/local/miniconda3/envs/AnimeGANv2/lib/python3.6/site-packages/tensorflow_core/python/ops/math_ops.py", line 2754, in matmul
      a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
    File "/home/dcxx/local/miniconda3/envs/AnimeGANv2/lib/python3.6/site-packages/tensorflow_core/python/ops/gen_math_ops.py", line 6136, in mat_mul
      name=name)
    File "/home/dcxx/local/miniconda3/envs/AnimeGANv2/lib/python3.6/site-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
      op_def=op_def)
    File "/home/dcxx/local/miniconda3/envs/AnimeGANv2/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
      return func(*args, **kwargs)
    File "/home/dcxx/local/miniconda3/envs/AnimeGANv2/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
      attrs, op_def, compute_device)
    File "/home/dcxx/local/miniconda3/envs/AnimeGANv2/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
      op_def=op_def)
    File "/home/dcxx/local/miniconda3/envs/AnimeGANv2/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
      self._traceback = tf_stack.extract_stack()

TachibanaYoshino / AnimeGANv2

train issues #70