代码复现问题 - Githubissues

YoungSeng commented 6 months ago

在【用Trinity和ZEGGS数据集进行训练时，train.py文件】方面遇到了一些困难，

其中loss的值一直为“nan”，不理解为什么会是这个值

YoungSeng commented 6 months ago

我有以下建议：

提供更多细节： 1.1 例如是200代之后才有nan的吗？刚开始有吗，如果是这样尝试重新训练，调小学习率 1.2 检查前一步处理好的数据是否有nan
或者尝试复现此代码，对照其代码和此项目的代码看看代码哪儿有问题

fairy1999-good commented 6 months ago

我有以下建议：

提供更多细节： 1.1 例如是200代之后才有nan的吗？刚开始有吗，如果是这样尝试重新训练，调小学习率 1.2 检查前一步处理好的数据是否有nan

或者尝试复现此代码，对照其代码和此项目的代码看看代码哪儿有问题

很高兴收到您的回复。一开始就一直是nan
[0/20001] [0/46] [('D_loss_gan', nan), ('G_loss_gan', nan), ('cycle_loss', 0.016952034085989), ('ee_loss', nan), ('rec_loss_0', nan), ('rec_loss_1', nan)] [0/20001] [1/46] [('D_loss_gan', nan), ('G_loss_gan', nan), ('cycle_loss', nan), ('ee_loss', nan), ('rec_loss_0', nan), ('rec_loss_1', nan)] 因为我是在windows系统中进行复现的，会不会是因为这个原因导致的呢？【在前期开始都挺顺利的】，当我训练到1200多轮的时候，就突然报出了系统权限的问题，因为我想尽可能的在windows下进行复现，所以我希望您能够帮我再提出一些建议，可以在windows下训练。微信图片_20231214201241 再次感谢您的回复。

fairy1999-good commented 6 months ago

我试了一下，调小学习率，结果只有首次的cycleloss有数值，后续都没有，而且一直是nan 我也检查了前一步中处理好的数据，也没有nan ![A33}O3W~`}$X3ISG@RBEZ6](https://github.com/YoungSeng/UnifiedGesture/assets/120563008/3bbac772-9884-42f3-a34c-7d594b7c59a6) 这些是用的数据集和处理后的文件，我又重新处理后，还是不行

fairy1999-good commented 6 months ago

YoungSeng commented 6 months ago

确实有点奇怪，你pdb一下看看dataloader的一个batch里面有没有NAN，需要debug看看哪里有问题；也可能是windows产生的路径之类的问题，试一下Linux呢？

fairy1999-good commented 6 months ago

6TX$(S%9~9ZX(YL $WRMQ%Q pdb后也没有出现NAN，看来只能试试Linux了

YoungSeng commented 6 months ago

确实有点奇怪，我也再试试看有没有同样的情况

YoungSeng commented 6 months ago

你解决了吗？我试了一下是能够正常运行的：

(UnifiedGesture) [yangsc21@mjrc-server10 retargeting]$ python train.py --save_dir=./my_model/ --cuda_device 'cuda:0'

load from file ./datasets/Trinity_ZEGGS/Trinity.npy
Window count: 756, total frame (without downsampling): 25557
full_fill [1, 0]
load from file ./datasets/Trinity_ZEGGS/ZEGGS.npy
Window count: 119, total frame (without downsampling): 4130
full_fill [0, 1]
full_fill [1, 0]
full_fill [1, 0]
full_fill [0, 1]
full_fill [0, 1]
[0/20001]       [0/1]    [('D_loss_gan', 0.5085693001747131), ('G_loss_gan', 0.5059844851493835), ('cycle_loss', 0.27006134390830994), ('ee_loss', 1.30
71155548095703), ('rec_loss_0', 2.8865870307124117), ('rec_loss_1', 1.5507260672379002)]
Save at ./my_model/models/topology0/0 succeed!
Save at ./my_model/models/topology1/0 succeed!
[1/20001]       [0/1]    [('D_loss_gan', 0.5062617063522339), ('G_loss_gan', 0.5016738176345825), ('cycle_loss', 0.24978837370872498), ('ee_loss', 1.28
9689540863037), ('rec_loss_0', 2.7880526189114314), ('rec_loss_1', 1.5224555518705944)]
[2/20001]       [0/1]    [('D_loss_gan', 0.5047091245651245), ('G_loss_gan', 0.4973878264427185), ('cycle_loss', 0.24400931596755981), ('ee_loss', 1.24
8523235321045), ('rec_loss_0', 2.722173921802817), ('rec_loss_1', 1.4981491062017906)]
[3/20001]       [0/1]    [('D_loss_gan', 0.5023590922355652), ('G_loss_gan', 0.4945446252822876), ('cycle_loss', 0.24701882898807526), ('ee_loss', 1.18
49168539047241), ('rec_loss_0', 2.657897939609255), ('rec_loss_1', 1.4709262551140196)]
[4/20001]       [0/1]    [('D_loss_gan', 0.5002940893173218), ('G_loss_gan', 0.49264442920684814), ('cycle_loss', 0.2600647807121277), ('ee_loss', 1.08
54196548461914), ('rec_loss_0', 2.579435413626445), ('rec_loss_1', 1.4372475559332663)]
...

fairy1999-good commented 6 months ago

再次感谢得到您的答复，由于设备资源问题我暂时无法在Linux系统上进行实现，然而我根据调整了norm和height的值后（都在后面加了一个数值0.00000000001），首次训练的全部loss值不再为nan（[0/20001] [0/46]），但是第二次的值以后又变成了nan（[0/20001] [1/46]），这又让我变得疑惑了起来，到现在还是没有解决这个问题（在Windows系统上运行）。

---原始邮件--- 发件人: "SiCheng @.> 发送时间: 2023年12月20日(周三) 下午3:09 收件人: @.>; 抄送: @.**@.>; 主题: Re: [YoungSeng/UnifiedGesture] 代码复现问题 (Issue #6)

你解决了吗？我试了一下是能够正常运行的： (UnifiedGesture) @. retargeting]$ python train.py --save_dir=./my_model/ --cuda_device 'cuda:0' load from file ./datasets/Trinity_ZEGGS/Trinity.npy Window count: 756, total frame (without downsampling): 25557 full_fill [1, 0] load from file ./datasets/Trinity_ZEGGS/ZEGGS.npy Window count: 119, total frame (without downsampling): 4130 full_fill [0, 1] full_fill [1, 0] full_fill [1, 0] full_fill [0, 1] full_fill [0, 1] [0/20001] [0/1] [('D_loss_gan', 0.5085693001747131), ('G_loss_gan', 0.5059844851493835), ('cycle_loss', 0.27006134390830994), ('ee_loss', 1.30 71155548095703), ('rec_loss_0', 2.8865870307124117), ('rec_loss_1', 1.5507260672379002)] Save at ./my_model/models/topology0/0 succeed! Save at ./my_model/models/topology1/0 succeed! [1/20001] [0/1] [('D_loss_gan', 0.5062617063522339), ('G_loss_gan', 0.5016738176345825), ('cycle_loss', 0.24978837370872498), ('ee_loss', 1.28 9689540863037), ('rec_loss_0', 2.7880526189114314), ('rec_loss_1', 1.5224555518705944)] [2/20001] [0/1] [('D_loss_gan', 0.5047091245651245), ('G_loss_gan', 0.4973878264427185), ('cycle_loss', 0.24400931596755981), ('ee_loss', 1.24 8523235321045), ('rec_loss_0', 2.722173921802817), ('rec_loss_1', 1.4981491062017906)] [3/20001] [0/1] [('D_loss_gan', 0.5023590922355652), ('G_loss_gan', 0.4945446252822876), ('cycle_loss', 0.24701882898807526), ('ee_loss', 1.18 49168539047241), ('rec_loss_0', 2.657897939609255), ('rec_loss_1', 1.4709262551140196)] [4/20001] [0/1] [('D_loss_gan', 0.5002940893173218), ('G_loss_gan', 0.49264442920684814), ('cycle_loss', 0.2600647807121277), ('ee_loss', 1.08 54196548461914), ('rec_loss_0', 2.579435413626445), ('rec_loss_1', 1.4372475559332663)] ...
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.>

fairy1999-good commented 6 months ago

我也试了您之前给我的另一个项目的代码（这次是在Linux系统上运行的，因为这个项目可以不用NVIDIA GPU ，只用CPU去跑），现在也是在一直训练中（通过看生成的logs文件日志，loss值也不再是nan，是成功的运行了），所以，就更疑惑了，难道真的只是因为系统的原因吗🤔🤔🤔？

---原始邮件--- 发件人: "SiCheng @.> 发送时间: 2023年12月20日(周三) 下午3:09 收件人: @.>; 抄送: @.**@.>; 主题: Re: [YoungSeng/UnifiedGesture] 代码复现问题 (Issue #6)

你解决了吗？我试了一下是能够正常运行的： (UnifiedGesture) @. retargeting]$ python train.py --save_dir=./my_model/ --cuda_device 'cuda:0' load from file ./datasets/Trinity_ZEGGS/Trinity.npy Window count: 756, total frame (without downsampling): 25557 full_fill [1, 0] load from file ./datasets/Trinity_ZEGGS/ZEGGS.npy Window count: 119, total frame (without downsampling): 4130 full_fill [0, 1] full_fill [1, 0] full_fill [1, 0] full_fill [0, 1] full_fill [0, 1] [0/20001] [0/1] [('D_loss_gan', 0.5085693001747131), ('G_loss_gan', 0.5059844851493835), ('cycle_loss', 0.27006134390830994), ('ee_loss', 1.30 71155548095703), ('rec_loss_0', 2.8865870307124117), ('rec_loss_1', 1.5507260672379002)] Save at ./my_model/models/topology0/0 succeed! Save at ./my_model/models/topology1/0 succeed! [1/20001] [0/1] [('D_loss_gan', 0.5062617063522339), ('G_loss_gan', 0.5016738176345825), ('cycle_loss', 0.24978837370872498), ('ee_loss', 1.28 9689540863037), ('rec_loss_0', 2.7880526189114314), ('rec_loss_1', 1.5224555518705944)] [2/20001] [0/1] [('D_loss_gan', 0.5047091245651245), ('G_loss_gan', 0.4973878264427185), ('cycle_loss', 0.24400931596755981), ('ee_loss', 1.24 8523235321045), ('rec_loss_0', 2.722173921802817), ('rec_loss_1', 1.4981491062017906)] [3/20001] [0/1] [('D_loss_gan', 0.5023590922355652), ('G_loss_gan', 0.4945446252822876), ('cycle_loss', 0.24701882898807526), ('ee_loss', 1.18 49168539047241), ('rec_loss_0', 2.657897939609255), ('rec_loss_1', 1.4709262551140196)] [4/20001] [0/1] [('D_loss_gan', 0.5002940893173218), ('G_loss_gan', 0.49264442920684814), ('cycle_loss', 0.2600647807121277), ('ee_loss', 1.08 54196548461914), ('rec_loss_0', 2.579435413626445), ('rec_loss_1', 1.4372475559332663)] ...
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.>

YoungSeng commented 6 months ago

那确实有点奇怪了，可能数据处理阶段有NAN吗

fairy1999-good commented 6 months ago

暂时我就发现了可能是norm和height有为零的可能然后我就加了一点点数值后才有了这次的变化但后续训练还是不可以，我在想会不会可能是剃度爆炸的原因，或者换另一个优化器再试试🤔🤔🤔

---原始邮件--- 发件人: "SiCheng @.> 发送时间: 2023年12月20日(周三) 下午4:37 收件人: @.>; 抄送: @.**@.>; 主题: Re: [YoungSeng/UnifiedGesture] 代码复现问题 (Issue #6)

那确实有点奇怪了，可能数据处理阶段有NAN吗

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

fairy1999-good commented 6 months ago

那确实有点奇怪了，可能数据处理阶段有NAN吗

我成功了在norm与height后面加入数值（可能之前那次有漏改的地方，所以一次之后就报错了） _4YGQ8T%3`%CA0WI1@`X ~Q 94%V EC)8C}NKWJ2{}G@){W _CL(N HQ}5IE@}N(B`F4LNN

YoungSeng commented 6 months ago

恭喜恭喜！解决了就好！

fairy1999-good commented 6 months ago

非常感谢!!!

---原始邮件--- 发件人: "SiCheng @.> 发送时间: 2023年12月21日(周四) 晚上7:35 收件人: @.>; 抄送: @.**@.>; 主题: Re: [YoungSeng/UnifiedGesture] 代码复现问题 (Issue #6)

恭喜恭喜！解决了就好！

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

fairy1999-good commented 6 months ago

您好，我有一个疑问想请教您，如果我在训练了大概3800多轮后中断了，然后想设置从第3800轮开始重新训练，这样做的话会不会对训练的数据造成影响？

---原始邮件--- 发件人: "SiCheng @.> 发送时间: 2023年12月21日(周四) 晚上7:35 收件人: @.>; 抄送: @.**@.>; 主题: Re: [YoungSeng/UnifiedGesture] 代码复现问题 (Issue #6)

恭喜恭喜！解决了就好！

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

YoungSeng commented 6 months ago

只要用保存的模型和优化器继续训练应该没有问题吧，把迭代代数设置正确就可以了

fairy1999-good commented 5 months ago

1705221457515 这个文件下的data.mdb文件的大小只有8KB，这没有生成正确的LMDB 数据集的样本，使得num_samples=0 1705221563139 er而其他的data.mdb文件200多MB或者900多MB 所以我想请问您，针对data.mdb=8KB大小的问题，该怎么解决呢？

YoungSeng commented 5 months ago

应该没有_cache_WavLM_36_aux这个lmdb文件吧？按Readme，应该3.3 Training VQVAE model有一个lmdb放在./retargeting/datasets/Trinity_ZEGGS/bvh2upper_lower_root/lmdb_latent_vel/ folder.和3.4Training diffusion model 有一个lmdb放在./dataset/all_lmdb_aux/。具体是哪一步骤报错了？

fairy1999-good commented 5 months ago

VE6~ACO2US23A(Q9~7KB 0E 每个文件夹下面都有data.mdb和lock.mdb WEK8Q)}CB9)`LQ@D C4A~O2 DX5E_~O$N7T4CJJKM8 `K~2 QM263EEWXJL4XZXW @X~(GR L}GIGL3O8HN0US{{1FPPF H 在TrinityDataset 类的 init 方法中我添加了打印语句，显示Number of samples in LMDB: 0

fairy1999-good commented 5 months ago

RX X RFSM5QS0Y26SF95B53 这里创建了名为 _cache_WavLM_36_aux 的文件夹

YoungSeng commented 5 months ago

那应该是3.4Training diffusion model这一步的lmdb没有正确生成，我对于每个数据集用两个文件进行调试，运行3.4的结果如下：

(UnifiedGesture) [yangsc21@mjrc-server12 UnifiedGesture]$ python process_code.py
Recording_002
Recording_006
067_Speech_2_mirror_x_1_0
067_Speech_2_x_1_0
(UnifiedGesture) [yangsc21@mjrc-server12 UnifiedGesture]$ python ./make_lmdb.py --base_path ./dataset/
Recording_002
(1612,) (3225, 363) (1, 6880000)
sys:1: FutureWarning: 'pyarrow.serialize' is deprecated as of 2.0.0 and will be removed in a future version. Use pickle or the pyarrow IPC functionality instead.
Recording_006
(1582,) (3164, 363) (1, 6749867)
sys:1: FutureWarning: 'pyarrow.serialize' is deprecated as of 2.0.0 and will be removed in a future version. Use pickle or the pyarrow IPC functionality instead.

得到的lmdb文件如下所示

是正常运行的，你看看3.4的lmdb文件有正确生成吗？就是readme文件写的这两行

python process_code.py
python ./make_lmdb.py --base_path ./dataset/

fairy1999-good commented 5 months ago

3.4Training diffusion model这一步是正确的 CCTZ 8S9ABXZCI8 $Z_YV9T UB~6 }YO3XC{3%`4D1{A3FG W0IWI PIK%L543}A(SKI8R0 IIEJ77ISD9N$}N6AT63DGBJ 但是在python end2end.py --config=./configs/all_data.yml --gpu 1 --save_dir "./result/my_diffusion"这一步后运行，就会出现_cache_WavLM_36_aux文件夹下的lmdb文件没有正确生成 AJZ VK5 YBX8S}JP5YSJM{5 C~O(9(L$8~V9J@MO1H{O@EV

YoungSeng commented 5 months ago

那就是后一步有问题

我这里运行python end2end.py --config=./configs/all_data.yml --gpu 1 --save_dir "./result/my_diffusion"后这样的：

lmdb_test文件夹下是类似的

INFO:root:Reading data '../dataset/all_lmdb_aux/lmdb_train/'...
INFO:WavLM:WavLM Config: {'extractor_mode': 'layer_norm', 'encoder_layers': 24, 'encoder_embed_dim': 1024, 'encoder_ffn_embed_dim': 4096, 'en
coder_attention_heads': 16, 'activation_fn': 'gelu', 'layer_norm_first': True, 'conv_feature_layers': '[(512,10,5)] + [(512,3,2)] * 4 + [(512
,2,2)] * 2', 'conv_bias': False, 'feature_grad_mult': 1.0, 'normalize': True, 'dropout': 0.0, 'attention_dropout': 0.0, 'activation_dropout':
 0.0, 'encoder_layerdrop': 0.0, 'dropout_input': 0.0, 'dropout_features': 0.0, 'mask_length': 10, 'mask_prob': 0.8, 'mask_selection': 'static
', 'mask_other': 0.0, 'no_mask_overlap': False, 'mask_min_space': 1, 'mask_channel_length': 10, 'mask_channel_prob': 0.0, 'mask_channel_selec
tion': 'static', 'mask_channel_other': 0.0, 'no_mask_channel_overlap': False, 'mask_channel_min_space': 1, 'conv_pos': 128, 'conv_pos_groups': 16, 'relative_position_embedding': True, 'num_buckets': 320, 'max_distance': 800, 'gru_rel_pos': True}
end2end.py:40: FutureWarning: 'pyarrow.deserialize' is deprecated as of 2.0.0 and will be removed in a future version. Use pickle or the pyarrow IPC functionality instead.
  pose_resampling_fps=args.motion_resampling_framerate, model='WavLM_36_aux')        # , model='Long_1200'
/ceph/hdd/yangsc21/Python/UnifiedGesture/diffusion_latent/data_loader/data_preprocessor.py:33: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at  /pytorch/torch/csrc/utils/tensor_numpy.cpp:180.)
  wav_input_16khz = torch.from_numpy(wav_input_16khz).to(device)
/ceph/hdd/yangsc21/Python/UnifiedGesture/diffusion_latent/data_loader/lmdb_data_loader.py:44: FutureWarning: 'pyarrow.serialize' is deprecated as of 2.0.0 and will be removed in a future version. Use pickle or the pyarrow IPC functionality instead.
  data_sampler.run()
no. of samples:  1042
INFO:root:Reading data '../dataset/all_lmdb_aux/lmdb_test/'...
INFO:WavLM:WavLM Config: {'extractor_mode': 'layer_norm', 'encoder_layers': 24, 'encoder_embed_dim': 1024, 'encoder_ffn_embed_dim': 4096, 'encoder_attention_heads': 16, 'activation_fn': 'gelu', 'layer_norm_first': True, 'conv_feature_layers': '[(512,10,5)] + [(512,3,2)] * 4 + [(512,2,2)] * 2', 'conv_bias': False, 'feature_grad_mult': 1.0, 'normalize': True, 'dropout': 0.0, 'attention_dropout': 0.0, 'activation_dropout': 0.0, 'encoder_layerdrop': 0.0, 'dropout_input': 0.0, 'dropout_features': 0.0, 'mask_length': 10, 'mask_prob': 0.8, 'mask_selection': 'static', 'mask_other': 0.0, 'no_mask_overlap': False, 'mask_min_space': 1, 'mask_channel_length': 10, 'mask_channel_prob': 0.0, 'mask_channel_selection': 'static', 'mask_channel_other': 0.0, 'no_mask_channel_overlap': False, 'mask_channel_min_space': 1, 'conv_pos': 128, 'conv_pos_groups': 16, 'relative_position_embedding': True, 'num_buckets': 320, 'max_distance': 800, 'gru_rel_pos': True}
end2end.py:47: FutureWarning: 'pyarrow.deserialize' is deprecated as of 2.0.0 and will be removed in a future version. Use pickle or the pyarrow IPC functionality instead.
  pose_resampling_fps=args.motion_resampling_framerate, model='WavLM_36_aux')         # , model='Long_1200'
no. of samples:  1063
INFO:root:len of train loader:4, len of test loader:4
USE WAVLM
TRANS_ENC init
EMBED STYLE BEGIN TOKEN
Cross Local Attention3
Starting epoch 0
  0%|                                                                                                                  | 0/4 [00:00<?, ?it/s]
/ceph/hdd/yangsc21/miniconda3/envs/UnifiedGesture/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py:44: FutureWarning: 'pyarrow.deserialize' is deprecated as of 2.0.0 and will be removed in a future version. Use pickle or the pyarrow IPC functionality instead.
  data = [self.dataset[idx] for idx in possibly_batched_index]
Logging to /tmp/openai-2024-01-15-19-38-15-574846
step[0]: loss[0.16063]
saving model...
 75%|███████████████████████████████████████████████████████████████████████████████▌                          | 3/4 [00:02<00:00,  1.66it/s]
 75%|███████████████████████████████████████████████████████████████████████████████▌                          | 3/4 [00:02<00:00,  1.17it/s]

后面没有问题

我的建议如下：

首先删掉lmdb_train和lmdb_test中的_cache_WavLM_36_aux文件夹，再重新运行python end2end.py --config=./configs/all_data.yml --gpu 1 --save_dir "./result/my_diffusion"，这一步相当于生成了所有文件的缓存文件，第一次生成会很耗时（估计二三十分钟），之后再跑就很快了，你目前有一个错误的8KB文件，我猜测是之前存在错误生成的缓存文件，导致不会再重新生成了；
如果还有问题，检查这一步生成缓存文件的代码

https://github.com/YoungSeng/UnifiedGesture/blob/8a405419c8f109248fcc8c9fe791368f647f904f/data_loader/lmdb_data_loader.py#L50

fairy1999-good commented 5 months ago

通过删除缓存和修改（cuda：2=cuda：0）正确生成了！，但是--config=./configs/all_data.yml的epoch是500，而我的训练情况如下：需要达到step=1000000吗?

YoungSeng commented 5 months ago

那个epoch我没设置不重要好像，是的，尝试100w的模型看看，恭喜解决了，代码确实有点乱

fairy1999-good commented 5 months ago

VA$HLL0X)`A$4ICQK~))(3E 这里我训练了step=1419100，暂停了，然后运行 OF73D$)%TPW1R%30I(45$FC $39FL{3X_C1F(Y3R{XYK9T`F$ 但是在# train reward model python reward_model_policy.py 这一步的时候发生了索引越界 1ETIKG_B1NFM)OMUC8W5)O0 然后，我添加了调试信息 print(f"len(pre_expert_data): {len(pre_expert_data)}") print(f"level_i: {level_i}, index_i: {index_i}") 输出结果是 len(pre_expert_data): 10 level_i: 5, index_i: 14258 我是这样认为的：level_i 是5，index_i 是14258，这表明在 pre_expert_data 的第6个（从0开始计数）噪声水平（level_i为5）中，访问了其第14259个轨迹（index_i为14258），但 pre_expert_data 中只有10个噪声水平和每个噪声水平大约有2000个轨迹，这说明我的step步数太少才导致的越界吗？

YoungSeng commented 5 months ago

到这里你可以用自己训练好的模型试试Quick start中的sample效果了；微调之后的结果这个RL问题请@zerlinwang 大佬帮忙看看

fairy1999-good commented 5 months ago

请求大佬帮助，我替换了训练好的模型，在这一步中cd ../retargeting/ python demo.py --target ZEGGS --input_file "../diffusion_latent/result_quick_start/Trinity/005_Neutral_4_x_1_0_minibatch1080[0, 0, 0, 0, 0, 3, 0]_123456_recon.npy" --ref_path './datasets/bvh2latent/ZEGGS/065_Speech_0_x_1_0.npy' --output_path '../result/inference/Trinity/' --cuda_device cuda:0 我更改了一些代码，使其遍历--input_file "../diffusion_latent/result_quick_start/Trinity/"和 --ref_path './datasets/bvh2latent/ZEGGS/下的所有.npy文件，但是出现了错误， }2 EP 86%SYV71F() PQ{90 是我意会错了吗？不是这样操作Quick start中的sample效果吗？

YoungSeng commented 5 months ago

尽量和quickstart中的命名命名保持一致吧，那个retargeting的代码确实有点乱，这里显示的好像是124行的这个维度有问题。尝试下先不修改代码用subprocess.run呢

fairy1999-good commented 5 months ago

但这样只能运行指定的一个.npy文件， BI6BTA4F7OUHM7Z88(8EJ9T 拿这些的运行结果，跟之前没更换自己训练好的模型时候的结果进行对比就可以了吗？ UT$@4YQH(@WHW $~R~{ADKG 我还有一个问题就是，这个结果通过blender导入该.bvh文件，然后可以结合任意的.wav音频文件，就可以看到语音驱动的效果吗？

YoungSeng commented 5 months ago

要推理文件夹下的所有文件可能需要自己写一下代码了，我不记得之前有没有了；就是可以看看你自己重新训练的训练咋样；是的Blender可以实现，可以参考这里，大概在1：14的时候有插入音频的操作。

zerlinwang commented 4 months ago

这里我训练了step=1419100，暂停了，然后运行 $39FL{3X_C1F(Y3R{XYK9TF$ 但是在# train reward model python reward_model_policy.py 这一步的时候发生了索引越界然后，我添加了调试信息 print(f"len(pre_expert_data): {len(pre_expert_data)}") print(f"level_i: {level_i}, index_i: {index_i}") 输出结果是 len(pre_expert_data): 10 level_i: 5, index_i: 14258 我是这样认为的：level_i 是5，index_i 是14258，这表明在 pre_expert_data 的第6个（从0开始计数）噪声水平（level_i为5）中，访问了其第14259个轨迹（index_i为14258），但 pre_expert_data 中只有10个噪声水平和每个噪声水平大约有2000个轨迹，这说明我的step步数太少才导致的越界吗？

您好，不好意思久等了，我刚从春节假期赶回。想问问您这边相较于readme给定的步骤，主要修改了哪些参数/模型/步骤呢？这有助于我复现你的错误。

YoungSeng / UnifiedGesture

代码复现问题 #6