RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

biscuit1103 commented 4 months ago

Thanks for your great work! during the replication process, I encountered a stumbling block while Eval. I encountered this problem, how should I solve it? 55%|█████▌ | 11/20 [4:04:49<3:20:18, 1335.42s/it] Traceback (most recent call last): File "GPT_eval_multi.py", line 115, in best_fid, best_iter, best_div, best_top1, best_top2, best_top3, best_matching, best_multi, writer, logger = eval_trans.evaluation_transformer(args.out_dir, val_loader, net, trans_encoder, logger, writer, 0, best_fid=1000, best_iter=0, best_div=100, best_top1=0, best_top2=0, best_top3=0, best_matching=100, clip_model=clip_model, eval_wrapper=eval_wrapper, dataname=args.dataname, save = False, num_repeat=11, rand_pos=True) File "/data1/anaconda3/envs/T2M/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, kwargs) File "/data2/MMM-main/utils/eval_trans.py", line 217, in evaluation_transformer index_motion = trans(feat_clip_text, word_emb, type="sample", m_length=pred_len, rand_pos=rand_pos, CFG=CFG) File "/data1/anaconda3/envs/T2M/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, *kwargs) File "/data2/MMM-main/models/t2m_trans.py", line 199, in forward return self.sample(args, kwargs) File "/data2/MMM-main/models/t2m_trans.py", line 263, in sample sorted_score_indices = scores.multinomial(scores.shape[-1], replacement=False) # stocastic RuntimeError: probability tensor contains either inf, nan or element < 0

exitudio commented 4 months ago

Hi. Thank you for your interest. Are you evaluating on our pretrain model? Does the problem occur every time?

biscuit1103 commented 4 months ago

Hi. Thank you for your interest. Are you evaluating on our pretrain model? Does the problem occur every time? thanks for your reply .yes , when run the evaluation command on given pre-trained model after 6 iters, it will occur . i used torch 1.8.1+cu111

exitudio commented 4 months ago

I found a potential problem in this line. I already commit the change. Along with the updated pretrain model.

Can you try to

Download the new pretrain model (Section 2.3 in read me)
```
bash dataset/prepare/download_model.sh
```

Run eval (Eval section in read me)

python GPT_eval_multi.py --exp-name eval_name --resume-pth output/vq/2023-07-19-04-17-17_12_VQVAE_20batchResetNRandom_8192_32/net_last.pth --resume-trans output/t2m/2023-10-10-03-17-01_HML3D_44_crsAtt2lyr_mask0.5-1/net_last.pth --num-local-layer 2

--num-local-layer 2 I use 2 local layers instead of 1 in the previous model

If you still see the problem, please let me know. Thank you.

biscuit1103 commented 3 months ago

I found a potential problem in this line. I already commit the change. Along with the updated pretrain model.

Can you try to

Download the new pretrain model (Section 2.3 in read me)
bash dataset/prepare/download_model.sh
Run eval (Eval section in read me)
python GPT_eval_multi.py --exp-name eval_name --resume-pth output/vq/2023-07-19-04-17-17_12_VQVAE_20batchResetNRandom_8192_32/net_last.pth --resume-trans output/t2m/2023-10-10-03-17-01_HML3D_44_crsAtt2lyr_mask0.5-1/net_last.pth --num-local-layer 2
--num-local-layer 2 I use 2 local layers instead of 1 in the previous model

If you still see the problem, please let me know. Thank you.

Thank you very much! I have solved this problem, but the evaluation results do not seem to reach the original results. I don't know which part caused the error.

exitudio commented 3 months ago

This issue should come from the randomness from this line. I just commented it out in the previous commit. Can you double check that it is updated?

biscuit1103 commented 3 months ago

This issue should come from the randomness from this line. I just commented it out in the previous commit. Can you double check that it is updated?

Thank you for your answer. This part of the code has been updated. After I modified it, I got an evaluation result of FID 0.088. If I use (predict length) to get 0.080 data, what parameters or pre-trained models should I replace?

exitudio commented 3 months ago

We use pretrain length estimator from Text-to-motion (the model that proposed along with HumanML3D paper) here

This can be plugged into our existing model. I will let you know after adding this code to the repo.

exitudio / MMM

RuntimeError: probability tensor contains either `inf`, `nan` or element < 0 #9