Can not train text2motion model

hoyeYang commented 2 years ago

I used the following command to train text2motion model.

python train_comp_v6.py --name Comp_v6_KLD01 --gpu_id 0 --lambda_kld 0.01 --dataset_name t2m

But an error occurred at about the 500th iteration.

epoch:   0 niter:     50 sub_epoch:  0 inner_iter:   49 1m 3s val_loss: 0.0000  loss_gen: 0.9746  loss_mot_rec: 0.6054  loss_mov_rec: 0.3078  loss_kld: 6.1342  sl_length:10 tf_ratio:0.40
epoch:   0 niter:    100 sub_epoch:  0 inner_iter:   99 1m 52s val_loss: 0.0000  loss_gen: 0.8278  loss_mot_rec: 0.5583  loss_mov_rec: 0.2669  loss_kld: 0.2562  sl_length:10 tf_ratio:0.40
epoch:   0 niter:    150 sub_epoch:  0 inner_iter:  149 2m 39s val_loss: 0.0000  loss_gen: 0.7858  loss_mot_rec: 0.5226  loss_mov_rec: 0.2612  loss_kld: 0.1987  sl_length:10 tf_ratio:0.40
epoch:   0 niter:    200 sub_epoch:  0 inner_iter:  199 3m 26s val_loss: 0.0000  loss_gen: 0.7743  loss_mot_rec: 0.5013  loss_mov_rec: 0.2703  loss_kld: 0.2686  sl_length:10 tf_ratio:0.40
epoch:   0 niter:    250 sub_epoch:  0 inner_iter:  249 4m 15s val_loss: 0.0000  loss_gen: 0.7172  loss_mot_rec: 0.4526  loss_mov_rec: 0.2606  loss_kld: 0.4083  sl_length:10 tf_ratio:0.40
epoch:   0 niter:    300 sub_epoch:  0 inner_iter:  299 5m 5s val_loss: 0.0000  loss_gen: 0.6978  loss_mot_rec: 0.4346  loss_mov_rec: 0.2569  loss_kld: 0.6277  sl_length:10 tf_ratio:0.40
epoch:   0 niter:    350 sub_epoch:  0 inner_iter:  349 5m 56s val_loss: 0.0000  loss_gen: 0.6697  loss_mot_rec: 0.3971  loss_mov_rec: 0.2579  loss_kld: 1.4755  sl_length:10 tf_ratio:0.40
epoch:   0 niter:    400 sub_epoch:  0 inner_iter:  399 6m 44s val_loss: 0.0000  loss_gen: 0.6330  loss_mot_rec: 0.3584  loss_mov_rec: 0.2516  loss_kld: 2.2961  sl_length:10 tf_ratio:0.40
epoch:   0 niter:    450 sub_epoch:  0 inner_iter:  449 7m 13s val_loss: 0.0000  loss_gen: 0.5758  loss_mot_rec: 0.3212  loss_mov_rec: 0.2324  loss_kld: 2.2157  sl_length:10 tf_ratio:0.40
[W python_anomaly_mode.cpp:104] Warning: Error detected in DivBackward0. Traceback of forward call that caused the error:
  File "train_comp_v6.py", line 149, in <module>
    trainer.train(train_dataset, val_dataset, plot_t2m)
  File "/SSD_DISK/users/projects/3Dpose/t2m_test/text-to-motion/networks/trainers.py", line 658, in train
    log_dict = self.update()
  File "/SSD_DISK/users/projects/3Dpose/t2m_test/text-to-motion/networks/trainers.py", line 480, in update
    loss_logs = self.backward_G()
  File "/SSD_DISK/users/projects/3Dpose/t2m_test/text-to-motion/networks/trainers.py", line 456, in backward_G
    self.loss_kld = self.kl_criterion(self.mus_post, self.logvars_post, self.mus_pri, self.logvars_pri)
  File "/SSD_DISK/users/projects/3Dpose/t2m_test/text-to-motion/networks/trainers.py", line 267, in kl_criterion
    2 * torch.exp(logvar2)) - 1 / 2
 (function _print_stack)
Traceback (most recent call last):
  File "train_comp_v6.py", line 149, in <module>
    trainer.train(train_dataset, val_dataset, plot_t2m)
  File "/SSD_DISK/users/projects/3Dpose/t2m_test/text-to-motion/networks/trainers.py", line 658, in train
    log_dict = self.update()
  File "/SSD_DISK/users/projects/3Dpose/t2m_test/text-to-motion/networks/trainers.py", line 484, in update
    self.loss_gen.backward()
  File "/SSD_DISK/users/software/anaconda3/envs/yhy_hm/lib/python3.7/site-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/SSD_DISK/users/software/anaconda3/envs/yhy_hm/lib/python3.7/site-packages/torch/autograd/__init__.py", line 147, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: Function 'DivBackward0' returned nan values in its 0th output.

EricGuo5513 commented 2 years ago

Hi, have you figured out the problem? I didn't countered this before. Possibly it is due to Nan values for loss calculation, which is also quite weird. Hope this link could help you https://github.com/pytorch/pytorch/issues/22820.

hoyeYang commented 2 years ago

Hi, sorry for my late reply. I tried the method mentioned in the link, but it did not work. And I eventually found that this debug was triggered by the nan value existing in the gt. I found the notebook used for preprocessing the dataset before and found the errors between the processed data and the given data were very large. capture So I reprocessed the dataset from scratch(including downloaded the amass dataset), but errors were still very large. It might be caused by some differences between my python env and yours. I will check it.

EricGuo5513 / text-to-motion

Can not train text2motion model #7