Open YoungSeng opened 6 months ago
我有以下建议:
我有以下建议:
- 提供更多细节: 1.1 例如是200代之后才有nan的吗?刚开始有吗,如果是这样尝试重新训练,调小学习率 1.2 检查前一步处理好的数据是否有nan
- 或者尝试复现此代码,对照其代码和此项目的代码看看代码哪儿有问题
很高兴收到您的回复。
一开始就一直是nan
[0/20001] [0/46] [('D_loss_gan', nan), ('G_loss_gan', nan), ('cycle_loss', 0.016952034085989), ('ee_loss', nan), ('rec_loss_0', nan), ('rec_loss_1', nan)]
[0/20001] [1/46] [('D_loss_gan', nan), ('G_loss_gan', nan), ('cycle_loss', nan), ('ee_loss', nan), ('rec_loss_0', nan), ('rec_loss_1', nan)]
因为我是在windows系统中进行复现的,会不会是因为这个原因导致的呢?【在前期开始都挺顺利的】,当我训练到1200多轮的时候,就突然报出了系统权限的问题,因为我想尽可能的在windows下进行复现,所以我希望您能够帮我再提出一些建议,可以在windows下训练。
再次感谢您的回复。
我试了一下,调小学习率,结果只有首次的cycleloss有数值,后续都没有,而且一直是nan 我也检查了前一步中处理好的数据,也没有nan ![A33}O3W~`}$X3ISG@RBEZ6](https://github.com/YoungSeng/UnifiedGesture/assets/120563008/3bbac772-9884-42f3-a34c-7d594b7c59a6) 这些是用的数据集和处理后的文件,我又重新处理后,还是不行
确实有点奇怪,你pdb一下看看dataloader的一个batch里面有没有NAN,需要debug看看哪里有问题;也可能是windows产生的路径之类的问题,试一下Linux呢?
pdb后也没有出现NAN,看来只能试试Linux了
确实有点奇怪,我也再试试看有没有同样的情况
你解决了吗?我试了一下是能够正常运行的:
(UnifiedGesture) [yangsc21@mjrc-server10 retargeting]$ python train.py --save_dir=./my_model/ --cuda_device 'cuda:0'
load from file ./datasets/Trinity_ZEGGS/Trinity.npy
Window count: 756, total frame (without downsampling): 25557
full_fill [1, 0]
load from file ./datasets/Trinity_ZEGGS/ZEGGS.npy
Window count: 119, total frame (without downsampling): 4130
full_fill [0, 1]
full_fill [1, 0]
full_fill [1, 0]
full_fill [0, 1]
full_fill [0, 1]
[0/20001] [0/1] [('D_loss_gan', 0.5085693001747131), ('G_loss_gan', 0.5059844851493835), ('cycle_loss', 0.27006134390830994), ('ee_loss', 1.30
71155548095703), ('rec_loss_0', 2.8865870307124117), ('rec_loss_1', 1.5507260672379002)]
Save at ./my_model/models/topology0/0 succeed!
Save at ./my_model/models/topology1/0 succeed!
[1/20001] [0/1] [('D_loss_gan', 0.5062617063522339), ('G_loss_gan', 0.5016738176345825), ('cycle_loss', 0.24978837370872498), ('ee_loss', 1.28
9689540863037), ('rec_loss_0', 2.7880526189114314), ('rec_loss_1', 1.5224555518705944)]
[2/20001] [0/1] [('D_loss_gan', 0.5047091245651245), ('G_loss_gan', 0.4973878264427185), ('cycle_loss', 0.24400931596755981), ('ee_loss', 1.24
8523235321045), ('rec_loss_0', 2.722173921802817), ('rec_loss_1', 1.4981491062017906)]
[3/20001] [0/1] [('D_loss_gan', 0.5023590922355652), ('G_loss_gan', 0.4945446252822876), ('cycle_loss', 0.24701882898807526), ('ee_loss', 1.18
49168539047241), ('rec_loss_0', 2.657897939609255), ('rec_loss_1', 1.4709262551140196)]
[4/20001] [0/1] [('D_loss_gan', 0.5002940893173218), ('G_loss_gan', 0.49264442920684814), ('cycle_loss', 0.2600647807121277), ('ee_loss', 1.08
54196548461914), ('rec_loss_0', 2.579435413626445), ('rec_loss_1', 1.4372475559332663)]
...
再次感谢得到您的答复,由于设备资源问题 我暂时无法在Linux系统上进行实现,然而我根据调整了norm和height的值后(都在后面加了一个数值0.00000000001),首次训练的全部loss值不再为nan([0/20001] [0/46]),但是第二次的值以后又变成了nan([0/20001] [1/46]),这又让我变得疑惑了起来,到现在还是没有解决这个问题(在Windows系统上运行)。
---原始邮件--- 发件人: "SiCheng @.> 发送时间: 2023年12月20日(周三) 下午3:09 收件人: @.>; 抄送: @.**@.>; 主题: Re: [YoungSeng/UnifiedGesture] 代码复现问题 (Issue #6)
你解决了吗?我试了一下是能够正常运行的:
(UnifiedGesture) @. retargeting]$ python train.py --save_dir=./my_model/ --cuda_device 'cuda:0' load from file ./datasets/Trinity_ZEGGS/Trinity.npy Window count: 756, total frame (without downsampling): 25557 full_fill [1, 0] load from file ./datasets/Trinity_ZEGGS/ZEGGS.npy Window count: 119, total frame (without downsampling): 4130 full_fill [0, 1] full_fill [1, 0] full_fill [1, 0] full_fill [0, 1] full_fill [0, 1] [0/20001] [0/1] [('D_loss_gan', 0.5085693001747131), ('G_loss_gan', 0.5059844851493835), ('cycle_loss', 0.27006134390830994), ('ee_loss', 1.30 71155548095703), ('rec_loss_0', 2.8865870307124117), ('rec_loss_1', 1.5507260672379002)] Save at ./my_model/models/topology0/0 succeed! Save at ./my_model/models/topology1/0 succeed! [1/20001] [0/1] [('D_loss_gan', 0.5062617063522339), ('G_loss_gan', 0.5016738176345825), ('cycle_loss', 0.24978837370872498), ('ee_loss', 1.28 9689540863037), ('rec_loss_0', 2.7880526189114314), ('rec_loss_1', 1.5224555518705944)] [2/20001] [0/1] [('D_loss_gan', 0.5047091245651245), ('G_loss_gan', 0.4973878264427185), ('cycle_loss', 0.24400931596755981), ('ee_loss', 1.24 8523235321045), ('rec_loss_0', 2.722173921802817), ('rec_loss_1', 1.4981491062017906)] [3/20001] [0/1] [('D_loss_gan', 0.5023590922355652), ('G_loss_gan', 0.4945446252822876), ('cycle_loss', 0.24701882898807526), ('ee_loss', 1.18 49168539047241), ('rec_loss_0', 2.657897939609255), ('rec_loss_1', 1.4709262551140196)] [4/20001] [0/1] [('D_loss_gan', 0.5002940893173218), ('G_loss_gan', 0.49264442920684814), ('cycle_loss', 0.2600647807121277), ('ee_loss', 1.08 54196548461914), ('rec_loss_0', 2.579435413626445), ('rec_loss_1', 1.4372475559332663)] ...
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you commented.Message ID: @.>
我也试了您之前给我的另一个项目的代码(这次是在Linux系统上运行的,因为这个项目可以不用NVIDIA GPU ,只用CPU去跑),现在也是在一直训练中(通过看生成的logs文件日志,loss值也不再是nan,是成功的运行了),所以,就更疑惑了,难道真的只是因为系统的原因吗🤔🤔🤔?
---原始邮件--- 发件人: "SiCheng @.> 发送时间: 2023年12月20日(周三) 下午3:09 收件人: @.>; 抄送: @.**@.>; 主题: Re: [YoungSeng/UnifiedGesture] 代码复现问题 (Issue #6)
你解决了吗?我试了一下是能够正常运行的:
(UnifiedGesture) @. retargeting]$ python train.py --save_dir=./my_model/ --cuda_device 'cuda:0' load from file ./datasets/Trinity_ZEGGS/Trinity.npy Window count: 756, total frame (without downsampling): 25557 full_fill [1, 0] load from file ./datasets/Trinity_ZEGGS/ZEGGS.npy Window count: 119, total frame (without downsampling): 4130 full_fill [0, 1] full_fill [1, 0] full_fill [1, 0] full_fill [0, 1] full_fill [0, 1] [0/20001] [0/1] [('D_loss_gan', 0.5085693001747131), ('G_loss_gan', 0.5059844851493835), ('cycle_loss', 0.27006134390830994), ('ee_loss', 1.30 71155548095703), ('rec_loss_0', 2.8865870307124117), ('rec_loss_1', 1.5507260672379002)] Save at ./my_model/models/topology0/0 succeed! Save at ./my_model/models/topology1/0 succeed! [1/20001] [0/1] [('D_loss_gan', 0.5062617063522339), ('G_loss_gan', 0.5016738176345825), ('cycle_loss', 0.24978837370872498), ('ee_loss', 1.28 9689540863037), ('rec_loss_0', 2.7880526189114314), ('rec_loss_1', 1.5224555518705944)] [2/20001] [0/1] [('D_loss_gan', 0.5047091245651245), ('G_loss_gan', 0.4973878264427185), ('cycle_loss', 0.24400931596755981), ('ee_loss', 1.24 8523235321045), ('rec_loss_0', 2.722173921802817), ('rec_loss_1', 1.4981491062017906)] [3/20001] [0/1] [('D_loss_gan', 0.5023590922355652), ('G_loss_gan', 0.4945446252822876), ('cycle_loss', 0.24701882898807526), ('ee_loss', 1.18 49168539047241), ('rec_loss_0', 2.657897939609255), ('rec_loss_1', 1.4709262551140196)] [4/20001] [0/1] [('D_loss_gan', 0.5002940893173218), ('G_loss_gan', 0.49264442920684814), ('cycle_loss', 0.2600647807121277), ('ee_loss', 1.08 54196548461914), ('rec_loss_0', 2.579435413626445), ('rec_loss_1', 1.4372475559332663)] ...
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you commented.Message ID: @.>
那确实有点奇怪了,可能数据处理阶段有NAN吗
暂时我就发现了可能是norm和height有为零的可能然后我就加了一点点数值后 才有了这次的变化 但后续训练还是不可以,我在想会不会可能是剃度爆炸的原因,或者换另一个优化器再试试🤔🤔🤔
---原始邮件--- 发件人: "SiCheng @.> 发送时间: 2023年12月20日(周三) 下午4:37 收件人: @.>; 抄送: @.**@.>; 主题: Re: [YoungSeng/UnifiedGesture] 代码复现问题 (Issue #6)
那确实有点奇怪了,可能数据处理阶段有NAN吗
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>
那确实有点奇怪了,可能数据处理阶段有NAN吗
我成功了 在norm与height后面加入数值(可能之前那次有漏改的地方,所以一次之后就报错了)
恭喜恭喜!解决了就好!
非常感谢!!!
---原始邮件--- 发件人: "SiCheng @.> 发送时间: 2023年12月21日(周四) 晚上7:35 收件人: @.>; 抄送: @.**@.>; 主题: Re: [YoungSeng/UnifiedGesture] 代码复现问题 (Issue #6)
恭喜恭喜!解决了就好!
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>
您好,我有一个疑问想请教您,如果我在训练了大概3800多轮后中断了,然后想设置从第3800轮开始重新训练,这样做的话会不会对训练的数据造成影响?
---原始邮件--- 发件人: "SiCheng @.> 发送时间: 2023年12月21日(周四) 晚上7:35 收件人: @.>; 抄送: @.**@.>; 主题: Re: [YoungSeng/UnifiedGesture] 代码复现问题 (Issue #6)
恭喜恭喜!解决了就好!
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>
只要用保存的模型和优化器继续训练应该没有问题吧,把迭代代数设置正确就可以了
这个文件下的data.mdb文件的大小只有8KB,这没有生成正确的LMDB 数据集的样本,使得num_samples=0 er而其他的data.mdb文件200多MB或者900多MB 所以我想请问您,针对data.mdb=8KB大小的问题,该怎么解决呢?
应该没有_cache_WavLM_36_aux
这个lmdb文件吧?按Readme,应该3.3 Training VQVAE model有一个lmdb放在./retargeting/datasets/Trinity_ZEGGS/bvh2upper_lower_root/lmdb_latent_vel/ folder.
和3.4Training diffusion model 有一个lmdb放在./dataset/all_lmdb_aux/
。具体是哪一步骤报错了?
每个文件夹下面都有data.mdb和lock.mdb 在TrinityDataset 类的 init 方法中我添加了打印语句,显示Number of samples in LMDB: 0
这里创建了名为 _cache_WavLM_36_aux 的文件夹
那应该是3.4Training diffusion model这一步的lmdb没有正确生成,我对于每个数据集用两个文件进行调试,运行3.4的结果如下:
(UnifiedGesture) [yangsc21@mjrc-server12 UnifiedGesture]$ python process_code.py
Recording_002
Recording_006
067_Speech_2_mirror_x_1_0
067_Speech_2_x_1_0
(UnifiedGesture) [yangsc21@mjrc-server12 UnifiedGesture]$ python ./make_lmdb.py --base_path ./dataset/
Recording_002
(1612,) (3225, 363) (1, 6880000)
sys:1: FutureWarning: 'pyarrow.serialize' is deprecated as of 2.0.0 and will be removed in a future version. Use pickle or the pyarrow IPC functionality instead.
Recording_006
(1582,) (3164, 363) (1, 6749867)
sys:1: FutureWarning: 'pyarrow.serialize' is deprecated as of 2.0.0 and will be removed in a future version. Use pickle or the pyarrow IPC functionality instead.
得到的lmdb文件如下所示
是正常运行的,你看看3.4的lmdb文件有正确生成吗?就是readme文件写的这两行
python process_code.py
python ./make_lmdb.py --base_path ./dataset/
3.4Training diffusion model这一步是正确的 但是在python end2end.py --config=./configs/all_data.yml --gpu 1 --save_dir "./result/my_diffusion"这一步后运行,就会出现_cache_WavLM_36_aux文件夹下的lmdb文件没有正确生成
那就是后一步有问题
我这里运行python end2end.py --config=./configs/all_data.yml --gpu 1 --save_dir "./result/my_diffusion"
后这样的:
lmdb_test
文件夹下是类似的
INFO:root:Reading data '../dataset/all_lmdb_aux/lmdb_train/'...
INFO:WavLM:WavLM Config: {'extractor_mode': 'layer_norm', 'encoder_layers': 24, 'encoder_embed_dim': 1024, 'encoder_ffn_embed_dim': 4096, 'en
coder_attention_heads': 16, 'activation_fn': 'gelu', 'layer_norm_first': True, 'conv_feature_layers': '[(512,10,5)] + [(512,3,2)] * 4 + [(512
,2,2)] * 2', 'conv_bias': False, 'feature_grad_mult': 1.0, 'normalize': True, 'dropout': 0.0, 'attention_dropout': 0.0, 'activation_dropout':
0.0, 'encoder_layerdrop': 0.0, 'dropout_input': 0.0, 'dropout_features': 0.0, 'mask_length': 10, 'mask_prob': 0.8, 'mask_selection': 'static
', 'mask_other': 0.0, 'no_mask_overlap': False, 'mask_min_space': 1, 'mask_channel_length': 10, 'mask_channel_prob': 0.0, 'mask_channel_selec
tion': 'static', 'mask_channel_other': 0.0, 'no_mask_channel_overlap': False, 'mask_channel_min_space': 1, 'conv_pos': 128, 'conv_pos_groups': 16, 'relative_position_embedding': True, 'num_buckets': 320, 'max_distance': 800, 'gru_rel_pos': True}
end2end.py:40: FutureWarning: 'pyarrow.deserialize' is deprecated as of 2.0.0 and will be removed in a future version. Use pickle or the pyarrow IPC functionality instead.
pose_resampling_fps=args.motion_resampling_framerate, model='WavLM_36_aux') # , model='Long_1200'
/ceph/hdd/yangsc21/Python/UnifiedGesture/diffusion_latent/data_loader/data_preprocessor.py:33: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:180.)
wav_input_16khz = torch.from_numpy(wav_input_16khz).to(device)
/ceph/hdd/yangsc21/Python/UnifiedGesture/diffusion_latent/data_loader/lmdb_data_loader.py:44: FutureWarning: 'pyarrow.serialize' is deprecated as of 2.0.0 and will be removed in a future version. Use pickle or the pyarrow IPC functionality instead.
data_sampler.run()
no. of samples: 1042
INFO:root:Reading data '../dataset/all_lmdb_aux/lmdb_test/'...
INFO:WavLM:WavLM Config: {'extractor_mode': 'layer_norm', 'encoder_layers': 24, 'encoder_embed_dim': 1024, 'encoder_ffn_embed_dim': 4096, 'encoder_attention_heads': 16, 'activation_fn': 'gelu', 'layer_norm_first': True, 'conv_feature_layers': '[(512,10,5)] + [(512,3,2)] * 4 + [(512,2,2)] * 2', 'conv_bias': False, 'feature_grad_mult': 1.0, 'normalize': True, 'dropout': 0.0, 'attention_dropout': 0.0, 'activation_dropout': 0.0, 'encoder_layerdrop': 0.0, 'dropout_input': 0.0, 'dropout_features': 0.0, 'mask_length': 10, 'mask_prob': 0.8, 'mask_selection': 'static', 'mask_other': 0.0, 'no_mask_overlap': False, 'mask_min_space': 1, 'mask_channel_length': 10, 'mask_channel_prob': 0.0, 'mask_channel_selection': 'static', 'mask_channel_other': 0.0, 'no_mask_channel_overlap': False, 'mask_channel_min_space': 1, 'conv_pos': 128, 'conv_pos_groups': 16, 'relative_position_embedding': True, 'num_buckets': 320, 'max_distance': 800, 'gru_rel_pos': True}
end2end.py:47: FutureWarning: 'pyarrow.deserialize' is deprecated as of 2.0.0 and will be removed in a future version. Use pickle or the pyarrow IPC functionality instead.
pose_resampling_fps=args.motion_resampling_framerate, model='WavLM_36_aux') # , model='Long_1200'
no. of samples: 1063
INFO:root:len of train loader:4, len of test loader:4
USE WAVLM
TRANS_ENC init
EMBED STYLE BEGIN TOKEN
Cross Local Attention3
Starting epoch 0
0%| | 0/4 [00:00<?, ?it/s]
/ceph/hdd/yangsc21/miniconda3/envs/UnifiedGesture/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py:44: FutureWarning: 'pyarrow.deserialize' is deprecated as of 2.0.0 and will be removed in a future version. Use pickle or the pyarrow IPC functionality instead.
data = [self.dataset[idx] for idx in possibly_batched_index]
Logging to /tmp/openai-2024-01-15-19-38-15-574846
step[0]: loss[0.16063]
saving model...
75%|███████████████████████████████████████████████████████████████████████████████▌ | 3/4 [00:02<00:00, 1.66it/s]
75%|███████████████████████████████████████████████████████████████████████████████▌ | 3/4 [00:02<00:00, 1.17it/s]
后面没有问题
我的建议如下:
lmdb_train
和lmdb_test
中的_cache_WavLM_36_aux
文件夹,再重新运行python end2end.py --config=./configs/all_data.yml --gpu 1 --save_dir "./result/my_diffusion"
,这一步相当于生成了所有文件的缓存文件,第一次生成会很耗时(估计二三十分钟),之后再跑就很快了,你目前有一个错误的8KB文件,我猜测是之前存在错误生成的缓存文件,导致不会再重新生成了;通过删除缓存和修改(cuda:2=cuda:0)正确生成了!,但是--config=./configs/all_data.yml的epoch是500,而我的训练情况如下: 需要达到step=1000000吗?
那个epoch
我没设置不重要好像,是的,尝试100w的模型看看,恭喜解决了,代码确实有点乱
这里我训练了step=1419100,暂停了,然后运行 但是在# train reward model python reward_model_policy.py 这一步的时候发生了索引越界 然后,我添加了调试信息 print(f"len(pre_expert_data): {len(pre_expert_data)}") print(f"level_i: {level_i}, index_i: {index_i}") 输出结果是 len(pre_expert_data): 10 level_i: 5, index_i: 14258 我是这样认为的:level_i 是5,index_i 是14258,这表明在 pre_expert_data 的第6个(从0开始计数)噪声水平(level_i为5)中,访问了其第14259个轨迹(index_i为14258),但 pre_expert_data 中只有10个噪声水平和每个噪声水平大约有2000个轨迹,这说明我的step步数太少才导致的越界吗?
到这里你可以用自己训练好的模型试试Quick start中的sample效果了;微调之后的结果这个RL问题请@zerlinwang 大佬帮忙看看
请求大佬帮助,我替换了训练好的模型,在这一步中cd ../retargeting/ python demo.py --target ZEGGS --input_file "../diffusion_latent/result_quick_start/Trinity/005_Neutral_4_x_1_0_minibatch1080[0, 0, 0, 0, 0, 3, 0]_123456_recon.npy" --ref_path './datasets/bvh2latent/ZEGGS/065_Speech_0_x_1_0.npy' --output_path '../result/inference/Trinity/' --cuda_device cuda:0 我更改了一些代码,使其遍历--input_file "../diffusion_latent/result_quick_start/Trinity/"和 --ref_path './datasets/bvh2latent/ZEGGS/下的所有.npy文件,但是出现了错误, 是我意会错了吗?不是这样操作Quick start中的sample效果吗?
尽量和quickstart中的命名命名保持一致吧,那个retargeting的代码确实有点乱,这里显示的好像是124行的这个维度有问题。尝试下先不修改代码用subprocess.run
呢
但这样只能运行指定的一个.npy文件, 拿这些的运行结果,跟之前没更换自己训练好的模型时候的结果进行对比就可以了吗? 我还有一个问题就是,这个结果通过blender导入该.bvh文件,然后可以结合任意的.wav音频文件,就可以看到语音驱动的效果吗?
要推理文件夹下的所有文件可能需要自己写一下代码了,我不记得之前有没有了;就是可以看看你自己重新训练的训练咋样;是的Blender可以实现,可以参考这里,大概在1:14的时候有插入音频的操作。
这里我训练了step=1419100,暂停了,然后运行 但是在# train reward model python reward_model_policy.py 这一步的时候发生了索引越界 然后,我添加了调试信息 print(f"len(pre_expert_data): {len(pre_expert_data)}") print(f"level_i: {level_i}, index_i: {index_i}") 输出结果是 len(pre_expert_data): 10 level_i: 5, index_i: 14258 我是这样认为的:level_i 是5,index_i 是14258,这表明在 pre_expert_data 的第6个(从0开始计数)噪声水平(level_i为5)中,访问了其第14259个轨迹(index_i为14258),但 pre_expert_data 中只有10个噪声水平和每个噪声水平大约有2000个轨迹,这说明我的step步数太少才导致的越界吗?
您好,不好意思久等了,我刚从春节假期赶回。想问问您这边相较于readme给定的步骤,主要修改了哪些参数/模型/步骤呢?这有助于我复现你的错误。
在【用Trinity和ZEGGS数据集进行训练时,train.py文件】方面遇到了一些困难,
其中loss的值一直为“nan”,不理解为什么会是这个值