YoungSeng / UnifiedGesture

UnifiedGesture: A Unified Gesture Synthesis Model for Multiple Skeletons (ACM MM 2023 Oral)
BSD 2-Clause "Simplified" License
49 stars 2 forks source link

代码复现问题 #6

Open YoungSeng opened 6 months ago

YoungSeng commented 6 months ago

在【用Trinity和ZEGGS数据集进行训练时,train.py文件】方面遇到了一些困难,

image

其中loss的值一直为“nan”,不理解为什么会是这个值

YoungSeng commented 6 months ago

我有以下建议:

  1. 提供更多细节: 1.1 例如是200代之后才有nan的吗?刚开始有吗,如果是这样尝试重新训练,调小学习率 1.2 检查前一步处理好的数据是否有nan
  2. 或者尝试复现代码,对照其代码和此项目的代码看看代码哪儿有问题
fairy1999-good commented 6 months ago

我有以下建议:

  1. 提供更多细节: 1.1 例如是200代之后才有nan的吗?刚开始有吗,如果是这样尝试重新训练,调小学习率 1.2 检查前一步处理好的数据是否有nan
  2. 或者尝试复现代码,对照其代码和此项目的代码看看代码哪儿有问题

很高兴收到您的回复。 一开始就一直是nan
[0/20001] [0/46] [('D_loss_gan', nan), ('G_loss_gan', nan), ('cycle_loss', 0.016952034085989), ('ee_loss', nan), ('rec_loss_0', nan), ('rec_loss_1', nan)] [0/20001] [1/46] [('D_loss_gan', nan), ('G_loss_gan', nan), ('cycle_loss', nan), ('ee_loss', nan), ('rec_loss_0', nan), ('rec_loss_1', nan)] 因为我是在windows系统中进行复现的,会不会是因为这个原因导致的呢?【在前期开始都挺顺利的】,当我训练到1200多轮的时候,就突然报出了系统权限的问题,因为我想尽可能的在windows下进行复现,所以我希望您能够帮我再提出一些建议,可以在windows下训练。 微信图片_20231214201241 再次感谢您的回复。

fairy1999-good commented 6 months ago

我试了一下,调小学习率,结果只有首次的cycleloss有数值,后续都没有,而且一直是nan 1 2 3 我也检查了前一步中处理好的数据,也没有nan 4 ![A33}O3W~`}$X3ISG@RBEZ6](https://github.com/YoungSeng/UnifiedGesture/assets/120563008/3bbac772-9884-42f3-a34c-7d594b7c59a6) 这些是用的数据集和处理后的文件,我又重新处理后,还是不行

fairy1999-good commented 6 months ago

11 22 33 44 55 66

YoungSeng commented 6 months ago

确实有点奇怪,你pdb一下看看dataloader的一个batch里面有没有NAN,需要debug看看哪里有问题;也可能是windows产生的路径之类的问题,试一下Linux呢?

fairy1999-good commented 6 months ago

6TX$(S%9~9ZX(YL $WRMQ%Q pdb后也没有出现NAN,看来只能试试Linux了

YoungSeng commented 6 months ago

确实有点奇怪,我也再试试看有没有同样的情况

YoungSeng commented 6 months ago

你解决了吗?我试了一下是能够正常运行的:

(UnifiedGesture) [yangsc21@mjrc-server10 retargeting]$ python train.py --save_dir=./my_model/ --cuda_device 'cuda:0'

load from file ./datasets/Trinity_ZEGGS/Trinity.npy
Window count: 756, total frame (without downsampling): 25557
full_fill [1, 0]
load from file ./datasets/Trinity_ZEGGS/ZEGGS.npy
Window count: 119, total frame (without downsampling): 4130
full_fill [0, 1]
full_fill [1, 0]
full_fill [1, 0]
full_fill [0, 1]
full_fill [0, 1]
[0/20001]       [0/1]    [('D_loss_gan', 0.5085693001747131), ('G_loss_gan', 0.5059844851493835), ('cycle_loss', 0.27006134390830994), ('ee_loss', 1.30
71155548095703), ('rec_loss_0', 2.8865870307124117), ('rec_loss_1', 1.5507260672379002)]
Save at ./my_model/models/topology0/0 succeed!
Save at ./my_model/models/topology1/0 succeed!
[1/20001]       [0/1]    [('D_loss_gan', 0.5062617063522339), ('G_loss_gan', 0.5016738176345825), ('cycle_loss', 0.24978837370872498), ('ee_loss', 1.28
9689540863037), ('rec_loss_0', 2.7880526189114314), ('rec_loss_1', 1.5224555518705944)]
[2/20001]       [0/1]    [('D_loss_gan', 0.5047091245651245), ('G_loss_gan', 0.4973878264427185), ('cycle_loss', 0.24400931596755981), ('ee_loss', 1.24
8523235321045), ('rec_loss_0', 2.722173921802817), ('rec_loss_1', 1.4981491062017906)]
[3/20001]       [0/1]    [('D_loss_gan', 0.5023590922355652), ('G_loss_gan', 0.4945446252822876), ('cycle_loss', 0.24701882898807526), ('ee_loss', 1.18
49168539047241), ('rec_loss_0', 2.657897939609255), ('rec_loss_1', 1.4709262551140196)]
[4/20001]       [0/1]    [('D_loss_gan', 0.5002940893173218), ('G_loss_gan', 0.49264442920684814), ('cycle_loss', 0.2600647807121277), ('ee_loss', 1.08
54196548461914), ('rec_loss_0', 2.579435413626445), ('rec_loss_1', 1.4372475559332663)]
...
fairy1999-good commented 6 months ago

再次感谢得到您的答复,由于设备资源问题 我暂时无法在Linux系统上进行实现,然而我根据调整了norm和height的值后(都在后面加了一个数值0.00000000001),首次训练的全部loss值不再为nan([0/20001]  [0/46]),但是第二次的值以后又变成了nan([0/20001]  [1/46]),这又让我变得疑惑了起来,到现在还是没有解决这个问题(在Windows系统上运行)。

---原始邮件--- 发件人: "SiCheng @.> 发送时间: 2023年12月20日(周三) 下午3:09 收件人: @.>; 抄送: @.**@.>; 主题: Re: [YoungSeng/UnifiedGesture] 代码复现问题 (Issue #6)

你解决了吗?我试了一下是能够正常运行的: (UnifiedGesture) @. retargeting]$ python train.py --save_dir=./my_model/ --cuda_device 'cuda:0' load from file ./datasets/Trinity_ZEGGS/Trinity.npy Window count: 756, total frame (without downsampling): 25557 full_fill [1, 0] load from file ./datasets/Trinity_ZEGGS/ZEGGS.npy Window count: 119, total frame (without downsampling): 4130 full_fill [0, 1] full_fill [1, 0] full_fill [1, 0] full_fill [0, 1] full_fill [0, 1] [0/20001] [0/1] [('D_loss_gan', 0.5085693001747131), ('G_loss_gan', 0.5059844851493835), ('cycle_loss', 0.27006134390830994), ('ee_loss', 1.30 71155548095703), ('rec_loss_0', 2.8865870307124117), ('rec_loss_1', 1.5507260672379002)] Save at ./my_model/models/topology0/0 succeed! Save at ./my_model/models/topology1/0 succeed! [1/20001] [0/1] [('D_loss_gan', 0.5062617063522339), ('G_loss_gan', 0.5016738176345825), ('cycle_loss', 0.24978837370872498), ('ee_loss', 1.28 9689540863037), ('rec_loss_0', 2.7880526189114314), ('rec_loss_1', 1.5224555518705944)] [2/20001] [0/1] [('D_loss_gan', 0.5047091245651245), ('G_loss_gan', 0.4973878264427185), ('cycle_loss', 0.24400931596755981), ('ee_loss', 1.24 8523235321045), ('rec_loss_0', 2.722173921802817), ('rec_loss_1', 1.4981491062017906)] [3/20001] [0/1] [('D_loss_gan', 0.5023590922355652), ('G_loss_gan', 0.4945446252822876), ('cycle_loss', 0.24701882898807526), ('ee_loss', 1.18 49168539047241), ('rec_loss_0', 2.657897939609255), ('rec_loss_1', 1.4709262551140196)] [4/20001] [0/1] [('D_loss_gan', 0.5002940893173218), ('G_loss_gan', 0.49264442920684814), ('cycle_loss', 0.2600647807121277), ('ee_loss', 1.08 54196548461914), ('rec_loss_0', 2.579435413626445), ('rec_loss_1', 1.4372475559332663)] ...
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID:
@.>

fairy1999-good commented 6 months ago

我也试了您之前给我的另一个项目的代码(这次是在Linux系统上运行的,因为这个项目可以不用NVIDIA GPU ,只用CPU去跑),现在也是在一直训练中(通过看生成的logs文件日志,loss值也不再是nan,是成功的运行了),所以,就更疑惑了,难道真的只是因为系统的原因吗🤔🤔🤔?

---原始邮件--- 发件人: "SiCheng @.> 发送时间: 2023年12月20日(周三) 下午3:09 收件人: @.>; 抄送: @.**@.>; 主题: Re: [YoungSeng/UnifiedGesture] 代码复现问题 (Issue #6)

你解决了吗?我试了一下是能够正常运行的: (UnifiedGesture) @. retargeting]$ python train.py --save_dir=./my_model/ --cuda_device 'cuda:0' load from file ./datasets/Trinity_ZEGGS/Trinity.npy Window count: 756, total frame (without downsampling): 25557 full_fill [1, 0] load from file ./datasets/Trinity_ZEGGS/ZEGGS.npy Window count: 119, total frame (without downsampling): 4130 full_fill [0, 1] full_fill [1, 0] full_fill [1, 0] full_fill [0, 1] full_fill [0, 1] [0/20001] [0/1] [('D_loss_gan', 0.5085693001747131), ('G_loss_gan', 0.5059844851493835), ('cycle_loss', 0.27006134390830994), ('ee_loss', 1.30 71155548095703), ('rec_loss_0', 2.8865870307124117), ('rec_loss_1', 1.5507260672379002)] Save at ./my_model/models/topology0/0 succeed! Save at ./my_model/models/topology1/0 succeed! [1/20001] [0/1] [('D_loss_gan', 0.5062617063522339), ('G_loss_gan', 0.5016738176345825), ('cycle_loss', 0.24978837370872498), ('ee_loss', 1.28 9689540863037), ('rec_loss_0', 2.7880526189114314), ('rec_loss_1', 1.5224555518705944)] [2/20001] [0/1] [('D_loss_gan', 0.5047091245651245), ('G_loss_gan', 0.4973878264427185), ('cycle_loss', 0.24400931596755981), ('ee_loss', 1.24 8523235321045), ('rec_loss_0', 2.722173921802817), ('rec_loss_1', 1.4981491062017906)] [3/20001] [0/1] [('D_loss_gan', 0.5023590922355652), ('G_loss_gan', 0.4945446252822876), ('cycle_loss', 0.24701882898807526), ('ee_loss', 1.18 49168539047241), ('rec_loss_0', 2.657897939609255), ('rec_loss_1', 1.4709262551140196)] [4/20001] [0/1] [('D_loss_gan', 0.5002940893173218), ('G_loss_gan', 0.49264442920684814), ('cycle_loss', 0.2600647807121277), ('ee_loss', 1.08 54196548461914), ('rec_loss_0', 2.579435413626445), ('rec_loss_1', 1.4372475559332663)] ...
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID:
@.>

YoungSeng commented 6 months ago

那确实有点奇怪了,可能数据处理阶段有NAN吗

fairy1999-good commented 6 months ago

暂时我就发现了可能是norm和height有为零的可能然后我就加了一点点数值后 才有了这次的变化 但后续训练还是不可以,我在想会不会可能是剃度爆炸的原因,或者换另一个优化器再试试🤔🤔🤔

---原始邮件--- 发件人: "SiCheng @.> 发送时间: 2023年12月20日(周三) 下午4:37 收件人: @.>; 抄送: @.**@.>; 主题: Re: [YoungSeng/UnifiedGesture] 代码复现问题 (Issue #6)

那确实有点奇怪了,可能数据处理阶段有NAN吗

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

fairy1999-good commented 6 months ago

那确实有点奇怪了,可能数据处理阶段有NAN吗

我成功了 在norm与height后面加入数值(可能之前那次有漏改的地方,所以一次之后就报错了) _4YGQ8T%3`%CA0WI1@`X ~Q 94%V EC)8C}NKWJ2{}G@){W _CL(N HQ}5IE@}N(B`F4LNN

YoungSeng commented 6 months ago

恭喜恭喜!解决了就好!

fairy1999-good commented 6 months ago

非常感谢!!!

---原始邮件--- 发件人: "SiCheng @.> 发送时间: 2023年12月21日(周四) 晚上7:35 收件人: @.>; 抄送: @.**@.>; 主题: Re: [YoungSeng/UnifiedGesture] 代码复现问题 (Issue #6)

恭喜恭喜!解决了就好!

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

fairy1999-good commented 6 months ago

您好,我有一个疑问想请教您,如果我在训练了大概3800多轮后中断了,然后想设置从第3800轮开始重新训练,这样做的话会不会对训练的数据造成影响?

---原始邮件--- 发件人: "SiCheng @.> 发送时间: 2023年12月21日(周四) 晚上7:35 收件人: @.>; 抄送: @.**@.>; 主题: Re: [YoungSeng/UnifiedGesture] 代码复现问题 (Issue #6)

恭喜恭喜!解决了就好!

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

YoungSeng commented 6 months ago

只要用保存的模型和优化器继续训练应该没有问题吧,把迭代代数设置正确就可以了

fairy1999-good commented 5 months ago

1705221457515 这个文件下的data.mdb文件的大小只有8KB,这没有生成正确的LMDB 数据集的样本,使得num_samples=0 1705221563139 er而其他的data.mdb文件200多MB或者900多MB 所以我想请问您,针对data.mdb=8KB大小的问题,该怎么解决呢?

YoungSeng commented 5 months ago

应该没有_cache_WavLM_36_aux这个lmdb文件吧?按Readme,应该3.3 Training VQVAE model有一个lmdb放在./retargeting/datasets/Trinity_ZEGGS/bvh2upper_lower_root/lmdb_latent_vel/ folder.和3.4Training diffusion model 有一个lmdb放在./dataset/all_lmdb_aux/。具体是哪一步骤报错了?

fairy1999-good commented 5 months ago

VE6~ACO2US23A(Q9~7KB 0E 每个文件夹下面都有data.mdb和lock.mdb WEK8Q)}CB9)`LQ@D C4A~O2 DX5E_~O$N7T4CJJKM8 `K~2 QM263EEWXJL4XZXW @X~(GR L}GIGL3O8HN0US{{1FPPF H 在TrinityDataset 类的 init 方法中我添加了打印语句,显示Number of samples in LMDB: 0

fairy1999-good commented 5 months ago

RX X RFSM5QS0Y26SF95B53 这里创建了名为 _cache_WavLM_36_aux 的文件夹

YoungSeng commented 5 months ago

那应该是3.4Training diffusion model这一步的lmdb没有正确生成,我对于每个数据集用两个文件进行调试,运行3.4的结果如下:

(UnifiedGesture) [yangsc21@mjrc-server12 UnifiedGesture]$ python process_code.py
Recording_002
Recording_006
067_Speech_2_mirror_x_1_0
067_Speech_2_x_1_0
(UnifiedGesture) [yangsc21@mjrc-server12 UnifiedGesture]$ python ./make_lmdb.py --base_path ./dataset/
Recording_002
(1612,) (3225, 363) (1, 6880000)
sys:1: FutureWarning: 'pyarrow.serialize' is deprecated as of 2.0.0 and will be removed in a future version. Use pickle or the pyarrow IPC functionality instead.
Recording_006
(1582,) (3164, 363) (1, 6749867)
sys:1: FutureWarning: 'pyarrow.serialize' is deprecated as of 2.0.0 and will be removed in a future version. Use pickle or the pyarrow IPC functionality instead.

得到的lmdb文件如下所示

image image

是正常运行的,你看看3.4的lmdb文件有正确生成吗?就是readme文件写的这两行

python process_code.py
python ./make_lmdb.py --base_path ./dataset/
fairy1999-good commented 5 months ago

3.4Training diffusion model这一步是正确的 CCTZ 8S9ABXZCI8 $Z_YV9T UB~6 }YO3XC{3%`4D1{A3FG W0IWI PIK%L543}A(SKI8R0 IIEJ77ISD9N$}N6AT63DGBJ 但是在python end2end.py --config=./configs/all_data.yml --gpu 1 --save_dir "./result/my_diffusion"这一步后运行,就会出现_cache_WavLM_36_aux文件夹下的lmdb文件没有正确生成 AJZ VK5 YBX8S}JP5YSJM{5 C~O(9(L$8~V9J@MO1H{O@EV

YoungSeng commented 5 months ago

那就是后一步有问题

我这里运行python end2end.py --config=./configs/all_data.yml --gpu 1 --save_dir "./result/my_diffusion"后这样的:

image image

lmdb_test文件夹下是类似的

INFO:root:Reading data '../dataset/all_lmdb_aux/lmdb_train/'...
INFO:WavLM:WavLM Config: {'extractor_mode': 'layer_norm', 'encoder_layers': 24, 'encoder_embed_dim': 1024, 'encoder_ffn_embed_dim': 4096, 'en
coder_attention_heads': 16, 'activation_fn': 'gelu', 'layer_norm_first': True, 'conv_feature_layers': '[(512,10,5)] + [(512,3,2)] * 4 + [(512
,2,2)] * 2', 'conv_bias': False, 'feature_grad_mult': 1.0, 'normalize': True, 'dropout': 0.0, 'attention_dropout': 0.0, 'activation_dropout':
 0.0, 'encoder_layerdrop': 0.0, 'dropout_input': 0.0, 'dropout_features': 0.0, 'mask_length': 10, 'mask_prob': 0.8, 'mask_selection': 'static
', 'mask_other': 0.0, 'no_mask_overlap': False, 'mask_min_space': 1, 'mask_channel_length': 10, 'mask_channel_prob': 0.0, 'mask_channel_selec
tion': 'static', 'mask_channel_other': 0.0, 'no_mask_channel_overlap': False, 'mask_channel_min_space': 1, 'conv_pos': 128, 'conv_pos_groups': 16, 'relative_position_embedding': True, 'num_buckets': 320, 'max_distance': 800, 'gru_rel_pos': True}
end2end.py:40: FutureWarning: 'pyarrow.deserialize' is deprecated as of 2.0.0 and will be removed in a future version. Use pickle or the pyarrow IPC functionality instead.
  pose_resampling_fps=args.motion_resampling_framerate, model='WavLM_36_aux')        # , model='Long_1200'
/ceph/hdd/yangsc21/Python/UnifiedGesture/diffusion_latent/data_loader/data_preprocessor.py:33: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at  /pytorch/torch/csrc/utils/tensor_numpy.cpp:180.)
  wav_input_16khz = torch.from_numpy(wav_input_16khz).to(device)
/ceph/hdd/yangsc21/Python/UnifiedGesture/diffusion_latent/data_loader/lmdb_data_loader.py:44: FutureWarning: 'pyarrow.serialize' is deprecated as of 2.0.0 and will be removed in a future version. Use pickle or the pyarrow IPC functionality instead.
  data_sampler.run()
no. of samples:  1042
INFO:root:Reading data '../dataset/all_lmdb_aux/lmdb_test/'...
INFO:WavLM:WavLM Config: {'extractor_mode': 'layer_norm', 'encoder_layers': 24, 'encoder_embed_dim': 1024, 'encoder_ffn_embed_dim': 4096, 'encoder_attention_heads': 16, 'activation_fn': 'gelu', 'layer_norm_first': True, 'conv_feature_layers': '[(512,10,5)] + [(512,3,2)] * 4 + [(512,2,2)] * 2', 'conv_bias': False, 'feature_grad_mult': 1.0, 'normalize': True, 'dropout': 0.0, 'attention_dropout': 0.0, 'activation_dropout': 0.0, 'encoder_layerdrop': 0.0, 'dropout_input': 0.0, 'dropout_features': 0.0, 'mask_length': 10, 'mask_prob': 0.8, 'mask_selection': 'static', 'mask_other': 0.0, 'no_mask_overlap': False, 'mask_min_space': 1, 'mask_channel_length': 10, 'mask_channel_prob': 0.0, 'mask_channel_selection': 'static', 'mask_channel_other': 0.0, 'no_mask_channel_overlap': False, 'mask_channel_min_space': 1, 'conv_pos': 128, 'conv_pos_groups': 16, 'relative_position_embedding': True, 'num_buckets': 320, 'max_distance': 800, 'gru_rel_pos': True}
end2end.py:47: FutureWarning: 'pyarrow.deserialize' is deprecated as of 2.0.0 and will be removed in a future version. Use pickle or the pyarrow IPC functionality instead.
  pose_resampling_fps=args.motion_resampling_framerate, model='WavLM_36_aux')         # , model='Long_1200'
no. of samples:  1063
INFO:root:len of train loader:4, len of test loader:4
USE WAVLM
TRANS_ENC init
EMBED STYLE BEGIN TOKEN
Cross Local Attention3
Starting epoch 0
  0%|                                                                                                                  | 0/4 [00:00<?, ?it/s]
/ceph/hdd/yangsc21/miniconda3/envs/UnifiedGesture/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py:44: FutureWarning: 'pyarrow.deserialize' is deprecated as of 2.0.0 and will be removed in a future version. Use pickle or the pyarrow IPC functionality instead.
  data = [self.dataset[idx] for idx in possibly_batched_index]
Logging to /tmp/openai-2024-01-15-19-38-15-574846
step[0]: loss[0.16063]
saving model...
 75%|███████████████████████████████████████████████████████████████████████████████▌                          | 3/4 [00:02<00:00,  1.66it/s]
 75%|███████████████████████████████████████████████████████████████████████████████▌                          | 3/4 [00:02<00:00,  1.17it/s]

后面没有问题

我的建议如下:

  1. 首先删掉lmdb_trainlmdb_test中的_cache_WavLM_36_aux文件夹,再重新运行python end2end.py --config=./configs/all_data.yml --gpu 1 --save_dir "./result/my_diffusion",这一步相当于生成了所有文件的缓存文件,第一次生成会很耗时(估计二三十分钟),之后再跑就很快了,你目前有一个错误的8KB文件,我猜测是之前存在错误生成的缓存文件,导致不会再重新生成了;
  2. 如果还有问题,检查这一步生成缓存文件的代码

https://github.com/YoungSeng/UnifiedGesture/blob/8a405419c8f109248fcc8c9fe791368f647f904f/data_loader/lmdb_data_loader.py#L50

fairy1999-good commented 5 months ago

通过删除缓存和修改(cuda:2=cuda:0)正确生成了!,但是--config=./configs/all_data.yml的epoch是500,而我的训练情况如下: image 需要达到step=1000000吗?

YoungSeng commented 5 months ago

那个epoch我没设置不重要好像,是的,尝试100w的模型看看,恭喜解决了,代码确实有点乱

fairy1999-good commented 5 months ago

VA$HLL0X)`A$4ICQK~))(3E 这里我训练了step=1419100,暂停了,然后运行 OF73D$)%TPW1R%30I(45$FC 39FL{3X_C1F(Y3R{XYK9T`F 但是在# train reward model python reward_model_policy.py 这一步的时候发生了索引越界 1ETIKG_B1NFM)OMUC8W5)O0 然后,我添加了调试信息 print(f"len(pre_expert_data): {len(pre_expert_data)}") print(f"level_i: {level_i}, index_i: {index_i}") 输出结果是 len(pre_expert_data): 10 level_i: 5, index_i: 14258 我是这样认为的:level_i 是5,index_i 是14258,这表明在 pre_expert_data 的第6个(从0开始计数)噪声水平(level_i为5)中,访问了其第14259个轨迹(index_i为14258),但 pre_expert_data 中只有10个噪声水平和每个噪声水平大约有2000个轨迹,这说明我的step步数太少才导致的越界吗?

YoungSeng commented 5 months ago

到这里你可以用自己训练好的模型试试Quick start中的sample效果了;微调之后的结果这个RL问题请@zerlinwang 大佬帮忙看看

fairy1999-good commented 5 months ago

请求大佬帮助,我替换了训练好的模型,在这一步中cd ../retargeting/ python demo.py --target ZEGGS --input_file "../diffusion_latent/result_quick_start/Trinity/005_Neutral_4_x_1_0_minibatch1080[0, 0, 0, 0, 0, 3, 0]_123456_recon.npy" --ref_path './datasets/bvh2latent/ZEGGS/065_Speech_0_x_1_0.npy' --output_path '../result/inference/Trinity/' --cuda_device cuda:0 我更改了一些代码,使其遍历--input_file "../diffusion_latent/result_quick_start/Trinity/"和 --ref_path './datasets/bvh2latent/ZEGGS/下的所有.npy文件,但是出现了错误, }2 EP 86%SYV71F() PQ{90 是我意会错了吗?不是这样操作Quick start中的sample效果吗?

YoungSeng commented 5 months ago

尽量和quickstart中的命名命名保持一致吧,那个retargeting的代码确实有点乱,这里显示的好像是124行的这个维度有问题。尝试下先不修改代码用subprocess.run

fairy1999-good commented 5 months ago

但这样只能运行指定的一个.npy文件, BI6BTA4F7OUHM7Z88(8EJ9T 拿这些的运行结果,跟之前没更换自己训练好的模型时候的结果进行对比就可以了吗? UT$@4YQH(@WHW $~R~{ADKG 我还有一个问题就是,这个结果通过blender导入该.bvh文件,然后可以结合任意的.wav音频文件,就可以看到语音驱动的效果吗?

YoungSeng commented 5 months ago

要推理文件夹下的所有文件可能需要自己写一下代码了,我不记得之前有没有了;就是可以看看你自己重新训练的训练咋样;是的Blender可以实现,可以参考这里,大概在1:14的时候有插入音频的操作。

zerlinwang commented 4 months ago

VA$HLL0X)A$4ICQK~))(3E 这里我训练了step=1419100,暂停了,然后运行 OF73D$)%TPW1R%30I(45$FC 39FL{3X_C1F(Y3R{XYK9TF 但是在# train reward model python reward_model_policy.py 这一步的时候发生了索引越界 1ETIKG_B1NFM)OMUC8W5)O0 然后,我添加了调试信息 print(f"len(pre_expert_data): {len(pre_expert_data)}") print(f"level_i: {level_i}, index_i: {index_i}") 输出结果是 len(pre_expert_data): 10 level_i: 5, index_i: 14258 我是这样认为的:level_i 是5,index_i 是14258,这表明在 pre_expert_data 的第6个(从0开始计数)噪声水平(level_i为5)中,访问了其第14259个轨迹(index_i为14258),但 pre_expert_data 中只有10个噪声水平和每个噪声水平大约有2000个轨迹,这说明我的step步数太少才导致的越界吗?

您好,不好意思久等了,我刚从春节假期赶回。想问问您这边相较于readme给定的步骤,主要修改了哪些参数/模型/步骤呢?这有助于我复现你的错误。