alexlee-gk / video_prediction

Stochastic Adversarial Video Prediction
https://alexlee-gk.github.io/video_prediction/
MIT License
303 stars 65 forks source link

running train_op took too long ?? #24

Open auzyze opened 5 years ago

auzyze commented 5 years ago

Thanks for sharing this great work!

I run into this issue when training ours_savp on kth dataset, the training looks going properly, but is very slow.

running train_op took too long (7.2s)
running train_op took too long (7.2s)
.....
progress  global step 100  epoch 0.5
          image/sec 1.1  remaining 37520m (625.3h) (26.1d)
d_loss 0.10482973
   discrim_video_sn_gan_loss (0.5204395, 0.1)
   discrim_video_sn_vae_gan_loss (0.5278577, 0.1)
g_loss 2.0725453
   gen_l1_loss (0.016228592, 100.0)
   gen_video_sn_gan_loss (0.32749984, 0.1)
   gen_video_sn_vae_gan_loss (0.35494953, 0.1)
   gen_video_sn_vae_gan_feature_cdist_loss (0.038144115, 10.0)
   gen_kl_loss (0.6190775, 0.0)
learning_rate 0.0002
running train_op took too long (7.2s)
running train_op took too long (7.2s)
running train_op took too long (7.3s)
......
......

My configuration: tensorflow: 1.10.0 cuda: 9.0 cudnn: 7.3.0.29

I'm running KTH dataset with ours_savp model. When I use default hparms, I got out of memory error, so I change batch_size=8.

My GPU looks working properly: +-------------------------------+----------------------+----------------------+ | 1 Tesla K40c Off | 00000000:02:00.0 Off | 0 | | 37% 73C P0 124W / 235W | 10963MiB / 11441MiB | 76% Default | +-------------------------------+----------------------+----------------------+

Tensorboard refreshes when summery_freq is reahced.

Appreciate for any suggestions. Regards,

nishokkumars commented 4 years ago

@alexlee-gk , could you please help? I am facing the same issue.

Berndinio commented 3 years ago

I am also facing the same issue. It seems like it is only a print in the train.py file line 267. Due to they are only printing it and nothing further happens in the if state and the checked variables arent used lateron, i assume it is training correctly. Maybe was originally trained on better gpus or tpus. As you can see, the sess.run() (where the time is measured from) is always executed. You can just add a tqdm to the for-loop it is in. So you can see the progress