Closed YeTianJHU closed 6 years ago
When I was trying to train a savp model, I met the same problem. Have you solved this issue?
No. I think it may relate to some cuDNN version issues. Still working on it.
The first 2 errors about topological order doesn't seem to be the issue and you can probably ignore it.
As you said, the problem is likely cudnn. According to here, you should use cuDNN SDK >= 7.2. Can you upgrade cudnn?
Problem solved. I need to reinstall tensorflow after upgrading the cudnn. Many thanks!!
Hi Alex,
I use tensorflow 1.11.0, CUDA 9.0.176 and cuDNN 7.3.1 on Ubuntu 16.04. My GPUs are nvidia titan Xp. When I was trying to train a savp model with command
CUDA_VISIBLE_DEVICES=0,1 python scripts/train.py --input_dir data/bair --dataset bair \
--model savp --model_hparams_dict hparams/bair_action_free/ours_savp/model_hparams.json \
--output_dir logs/bair_action_free/ours_savp \
--gpu_mem_frac 0.7
the program seems to stop runing and never move forward like in an endless loop, after outputing
2018-11-08 08:14:24.668709: E tensorflow/core/grappler/optimizers/dependency_optimizer.cc:666] Iteration = 0, topological sort failed with message: The graph couldn't be sorted in topological order.
2018-11-08 08:14:25.028003: E tensorflow/core/grappler/optimizers/dependency_optimizer.cc:666] Iteration = 1, topological sort failed with message: The graph couldn't be sorted in topological order.
What's the possible reason for this? More information is below
2018-11-08 08:13:09.286148: E tensorflow/core/grappler/optimizers/dependency_optimizer.cc:666] Iteration = 0, topological sort failed with message: The graph couldn't be sorted in topological order.
2018-11-08 08:13:09.515356: E tensorflow/core/grappler/optimizers/dependency_optimizer.cc:666] Iteration = 1, topological sort failed with message: The graph couldn't be sorted in topological order.
session.run took 22.4s
recording summary
done
recording image summary
done
progress global step 0 epoch 0 step 2560
discrim_video_sn_gan_loss (1.0238764, 1.0)
discrim_video_sn_vae_gan_loss (0.895689, 1.0)
gen_l1_loss (0.0804427, 100.0)
gen_video_sn_gan_loss (1.0158763, 1.0)
gen_video_sn_vae_gan_loss (0.8958128, 1.0)
gen_kl_loss (0.045274347, 0.0)
learning_rate 0.0002
saving model to logs/bair_action_free/ours_savp
done
2018-11-08 08:14:24.668709: E tensorflow/core/grappler/optimizers/dependency_optimizer.cc:666] Iteration = 0, topological sort failed with message: The graph couldn't be sorted in topological order.
2018-11-08 08:14:25.028003: E tensorflow/core/grappler/optimizers/dependency_optimizer.cc:666] Iteration = 1, topological sort failed with message: The graph couldn't be sorted in topological order.
Same problem here, please help
2019-05-14 15:09:28.745521: E tensorflow/core/grappler/optimizers/dependency_optimizer.cc:704] Iteration = 0, topological sort failed with message: The graph couldn't be sorted in topological order. 2019-05-14 15:09:30.421876: E tensorflow/core/grappler/optimizers/dependency_optimizer.cc:704] Iteration = 1, topological sort failed with message: The graph couldn't be sorted in topological order. 2019-05-14 15:13:00.040767: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.9.2 locally INFO:tensorflow:loss = 0.7136042, step = 0 2019-05-14 15:13:45.954118: E tensorflow/core/grappler/optimizers/dependency_optimizer.cc:704] Iteration = 0, topological sort failed with message: The graph couldn't be sorted in topological order. 2019-05-14 15:13:47.336729: E tensorflow/core/grappler/optimizers/dependency_optimizer.cc:704] Iteration = 1, topological sort failed with message: The graph couldn't be sorted in topological order.
When I was trying to generate gifs from a pre-trained model on the bair dataset, this error happens:
I'm using tensorflow 1.10.1, CUDA 9.0 and cuDNN 7.1 on Ubuntu 16.04. The CUDA/ cuDNN are installed properly. My GPUs are dual nvidia titan v. I also tried tensorflow 1.6.0 but the same result. Do you have any ideas about this error? Thanks in advance!
Update: I also tried cuDNN 7.0.5 but still have this problem. Thanks!