FunAudioLLM / CosyVoice

Multi-lingual large voice generation model, providing inference, training and deployment full-stack ability.
https://funaudiollm.github.io/
Apache License 2.0
6.44k stars 695 forks source link

Using an Instruct model for inference without embeddings, how can the speaker be specified? #671

Open 0xCAFEBABE0 opened 2 days ago

0xCAFEBABE0 commented 2 days ago

使用Instruct进行推理,希望固定音色输出音频,但现状会有所偏移,有概率出现男女混合的音频。

  1. 使用Instruct模型推理,没有embedding,如何指定音色呢?
  2. 若通过prompt_text描述音色,应该如何描述自定义的音色?
zzchust commented 1 day ago

+1

kingzcheung commented 1 day ago

使用这个方案把音色保存下来: https://github.com/FunAudioLLM/CosyVoice/issues/604

关键代码:

data = load_spk_from_wav(prompt_wav_upload, cosyvoice)
torch.save(data, f'speakers/{spk_name}.pt')

在frontend.py 里 修改 这个方法 frontend_sft:

    def frontend_sft(self, tts_text, spk_id):
        tts_text_token, tts_text_token_len = self._extract_text_token(tts_text)
        # embedding = self.spk2info[spk_id]['embedding']
        # 从pt文件里加载音色
        embedding = load_spk_from_pt(spk_id)['embedding']
        model_input = {'text': tts_text_token, 'text_len': tts_text_token_len, 'llm_embedding': embedding, 'flow_embedding': embedding}
        return model_input

注意这个方法其他模式也在用,总之你看着改吧。