关于PhoneticXvector的训练问题

CornellYu commented 3 years ago

赵淼你好，根据你的runPhoneticXvector.sh训练脚本，网络部分我自己修改了。我一共有89个iter，但是训练到28个iter的时候出现了报错。报错信息出现的情况相似于论坛中https://groups.google.com/g/kaldi-help/c/F7cud3lbDMo/m/VuNDG-qRBgAJ 我的报错信息是如下： WARNING （nnet3-train[5.5]）: ConstrainOrthonormalInternal():nnet-utils.cc:1055) Ratio is nan (should be >=1.0);component is tdnnf10.liner ASSERTION_FAILED (net-trian [5.5]: ConstrainOrthonormalInternal():nnet-utils.cc:1057) Assertion failed: (ratio > 0.9)

Snowdar commented 3 years ago

你好，从你的日志来看，是出现了nan问题，由于训练tdnnf网络时，会做一些正交约束的步骤，所以训练过程中可能会不太那么稳定，那么你需要检查一下你的特征和超参数，比如论坛中daniel povey建议说，减小学习率有助于训练稳定。如果还是不行，另外建议了解一下日志中ratio代表的含义，可能有助于你进行排查。

祝好！

On Dec 2, 2020, at 1:36 PM, CornellYu notifications@github.com wrote:

赵淼你好，根据你的runPhoneticXvector.sh训练脚本，网络部分我自己修改了。我一共有89个iter，但是训练到28个iter的时候出现了报错。报错信息出现的情况相似于论坛中https://groups.google.com/g/kaldi-help/c/F7cud3lbDMo/m/VuNDG-qRBgAJ 我的报错信息是如下： WARNING （nnet3-train[5.5]）: ConstrainOrthonormalInternal():nnet-utils.cc:1055) Ratio is nan (should be >=1.0);component is tdnnf10.liner ASSERTION_FAILED (net-trian [5.5]: ConstrainOrthonormalInternal():nnet-utils.cc:1057) Assertion failed: (ratio > 0.9)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

CornellYu commented 3 years ago

赵淼你好，根据dan的建议，我的学习率已经改回了0.001～0.0001，这个学习率应该不算太大，依旧出现这个问题。这个tdnnf层是我的ASR模型，也就是辅助网络，我的xvector网络是只用了tdnn层。有个想法，我是否可以考虑将每一层的LearningRate改为0，使之不再学习参数？能否可能解决这个问题。谢谢。

CornellYu commented 3 years ago

赵淼你好，根据dan的建议，我的学习率已经改回了0.001～0.0001，这个学习率应该不算太大，依旧出现这个问题。这个tdnnf层是我的ASR模型，也就是辅助网络，我的xvector网络是只用了tdnn层。有个想法，我是否可以考虑将每一层的LearningRate改为0，使之不再学习参数？能否可能解决这个问题。谢谢。

赵淼你好，根据上述的推测，目前的网络正在训练，暂时没有出现问题。谢谢，祝好。

Snowdar / asv-subtools

关于PhoneticXvector的训练问题 #15