FunAudioLLM / CosyVoice

Multi-lingual large voice generation model, providing inference, training and deployment full-stack ability.
https://funaudiollm.github.io/
Apache License 2.0
6.11k stars 654 forks source link

是否可以不用每次都去克隆声音,而是调用inference_sft,传spk_id,自己克隆的声音怎么替换到这个speakid #502

Open hjj-lmx opened 3 weeks ago

aluminumbox commented 3 weeks ago

看一下frontend_zero_shot的逻辑,存一下embedding即可

hjj-lmx commented 3 weeks ago

看一下frontend_zero_shot的逻辑,存一下embedding即可

请问一下,流式应该怎么写返回 这是我现在的写法 prompt_audio = (prompt_speech.numpy() * (2 15)).astype(np.int16).tobytes() prompt_speech_16k = torch.from_numpy(np.array(np.frombuffer(prompt_audio, dtype=np.int16))).unsqueeze(dim=0) prompt_speech_16k = prompt_speech_16k.float() / (2 15) for i, j in enumerate(clone_model.inference_cross_lingual(f'<|{lang}|>{doubao_content}', prompt_speech_16k, False, 1.0)): torchaudio.save('vctest.wav'.format(i), j['tts_speech'], 22050) 如果False改为True,下面的写法应该是怎么样的 还有个问题sft是不是比inference_cross_lingual快一点

hjj-lmx commented 3 weeks ago

看一下frontend_zero_shot的逻辑,存一下embedding即可

embedding如何保存,应该保存到哪里

smalldaidai commented 2 weeks ago

https://www.bilibili.com/video/BV1GS421R7f9/

maxusheng123 commented 1 week ago

看一下frontend_zero_shot的逻辑,存一下embedding即可

我看了一下,_extract_speech_token的时候比较耗时,是不是可以将需要的返回值存到本地,可以减少每次推理耗时 image