FunAudioLLM / CosyVoice

Multi-lingual large voice generation model, providing inference, training and deployment full-stack ability.
https://funaudiollm.github.io/
Apache License 2.0
6.47k stars 698 forks source link

请问如何提升复制音色的相似度 #649

Closed linweijiang closed 1 week ago

linweijiang commented 1 week ago

场景:类似3秒极速复刻的功能,提供一段自定义的音频,提取音频对应的音色Embedding pt文件,用以自定义音色选择的场景

问题:通过 frontend.py 中 _extract_spk_embedding 方法获取的音色Embedding,来生成音频的时候,发现生成的声音 跟 原始提供的音频声音有比较大的差异,听起来不像是同个音色的,声音变得尖锐很多,比较偏向女性。

想请问下有没有办法可以优化这个问题呢?谢谢

def _extract_spk_embedding(self, speech):
        feat = kaldi.fbank(speech,
                           num_mel_bins=80,
                           dither=0,
                           sample_frequency=16000)
        feat = feat - feat.mean(dim=0, keepdim=True)
        embedding = self.campplus_session.run(None,
                                              {self.campplus_session.get_inputs()[0].name: feat.unsqueeze(dim=0).cpu().numpy()})[0].flatten().tolist()
        embedding = torch.tensor([embedding]).to(self.device)
        return embedding
linweijiang commented 1 week ago

模拟3s极速复刻的接口,音频+文本即可