FunAudioLLM / CosyVoice

Multi-lingual large voice generation model, providing inference, training and deployment full-stack ability.
https://funaudiollm.github.io/
Apache License 2.0
5.63k stars 591 forks source link

重写spk2info.pt文件解决句子中间停顿较短的问题 #169

Open LayBrick opened 3 months ago

LayBrick commented 3 months ago

我听说可以通过提取声音特征,重写spk2info.pt来解决语音停顿较短的问题。请问这样做的原理是什么? 在保存spk2info.pt时,是只需要对多条音频的embedding进行均值处理吗?那多条音频的speech_token是进行拼接吗?speech_feat应该如何处理

aluminumbox commented 3 months ago

in spk2info.pt, the embedding is the average of utterance embedding, see tools/extract_embedding.py for how to extract spk embedding. the speech_token is extracted from a relatively good-quality audio of the speaker, but it is not used in sft inference mode, you can ignore it.

LayBrick commented 3 months ago

Whether rewriting spk2info.pt is useful to solve the short voice pause problem? How does this work?

jupinter commented 2 months ago

in spk2info.pt, the embedding is the average of utterance embedding, see tools/extract_embedding.py for how to extract spk embedding. the speech_token is extracted from a relatively good-quality audio of the speaker, but it is not used in sft inference mode, you can ignore it.

在pretrained sft模型中试过采用spk_embedding + speech_token&speech_feat,但是测试发现llm的稳定性下降了,容易出现漏、乱的现象,不用speech_token&speech_feat反而更稳定,还是没太理解背后的原因,能否帮忙介绍下?感谢!

aluminumbox commented 2 months ago

in spk2info.pt, the embedding is the average of utterance embedding, see tools/extract_embedding.py for how to extract spk embedding. the speech_token is extracted from a relatively good-quality audio of the speaker, but it is not used in sft inference mode, you can ignore it.

在pretrained sft模型中试过采用spk_embedding + speech_token&speech_feat,但是测试发现llm的稳定性下降了,容易出现漏、乱的现象,不用speech_token&speech_feat反而更稳定,还是没太理解背后的原因,能否帮忙介绍下?感谢!

after sft finetune, you are inferencing with this speaker and its spk embedding, so during inference use his spk embedding, check inference_sft, this inference mode is more compatible with sft training