Open LayBrick opened 3 months ago
in spk2info.pt, the embedding is the average of utterance embedding, see tools/extract_embedding.py for how to extract spk embedding. the speech_token is extracted from a relatively good-quality audio of the speaker, but it is not used in sft inference mode, you can ignore it.
Whether rewriting spk2info.pt is useful to solve the short voice pause problem? How does this work?
in spk2info.pt, the embedding is the average of utterance embedding, see tools/extract_embedding.py for how to extract spk embedding. the speech_token is extracted from a relatively good-quality audio of the speaker, but it is not used in sft inference mode, you can ignore it.
在pretrained sft模型中试过采用spk_embedding + speech_token&speech_feat,但是测试发现llm的稳定性下降了,容易出现漏、乱的现象,不用speech_token&speech_feat反而更稳定,还是没太理解背后的原因,能否帮忙介绍下?感谢!
in spk2info.pt, the embedding is the average of utterance embedding, see tools/extract_embedding.py for how to extract spk embedding. the speech_token is extracted from a relatively good-quality audio of the speaker, but it is not used in sft inference mode, you can ignore it.
在pretrained sft模型中试过采用spk_embedding + speech_token&speech_feat,但是测试发现llm的稳定性下降了,容易出现漏、乱的现象,不用speech_token&speech_feat反而更稳定,还是没太理解背后的原因,能否帮忙介绍下?感谢!
after sft finetune, you are inferencing with this speaker and its spk embedding, so during inference use his spk embedding, check inference_sft, this inference mode is more compatible with sft training
我听说可以通过提取声音特征,重写spk2info.pt来解决语音停顿较短的问题。请问这样做的原理是什么? 在保存spk2info.pt时,是只需要对多条音频的embedding进行均值处理吗?那多条音频的speech_token是进行拼接吗?speech_feat应该如何处理