2noise / ChatTTS

A generative speech model for daily dialogue.
https://2noise.com
GNU Affero General Public License v3.0
30.57k stars 3.32k forks source link

new version load model and voice pt model not work #704

Open LivinLuo1993 opened 3 weeks ago

LivinLuo1993 commented 3 weeks ago

when use new version, something occurred to me:

  1. load model

cannot use

chat.load(source="custom", custom_path=MODEL_PATH,  device='cpu', compile=False)

and meet the question assert self.has_loaded(use_decoder=use_decoder)

  1. self-define voice pt not work

when load model with the flowing codes:

chat._load(
        vocos_config_path=f'{MODEL_PATH}/config/vocos.yaml',
        vocos_ckpt_path=f'{MODEL_PATH}/asset/Vocos.pt',
        dvae_config_path=f'{MODEL_PATH}/config/dvae.yaml',
        dvae_ckpt_path=f'{MODEL_PATH}/asset/DVAE.pt',
        gpt_config_path=f'{MODEL_PATH}/config/gpt.yaml',
        gpt_ckpt_path=f'{MODEL_PATH}/asset/GPT.pt',
        decoder_config_path=f'{MODEL_PATH}/config/decoder.yaml',
        decoder_ckpt_path=f'{MODEL_PATH}/asset/Decoder.pt',
        tokenizer_path=f'{MODEL_PATH}/asset/tokenizer.pt',
        compile=False
    )

spk_stat = torch.load(f'{MODEL_PATH}/speaker/seed_1397_restored_emb.pt',  map_location=torch.device("cpu"))

rand_spk = chat._encode_spk_emb(spk_stat)

wav = chat.infer(
                        input_text,
                        skip_refine_text=True,
                        params_refine_text=params_refine_text,
                        params_infer_code=params_infer_code,
                        use_decoder=True,
                        do_text_normalization=False
                    )

output audio is the same with rand_spk = chat.sample_random_speaker()

fumiama commented 3 weeks ago

Now in dev branch,

You can refer to the HuggingFace repo for more details.

goshut commented 2 weeks ago

@LivinLuo1993 你在infer时都没有对声模进行传参吧..

import ChatTTS
from ChatTTS.model import Tokenizer

'''省略若干代码'''

spk = torch.load('声模.pt',
                         map_location='cpu',
                         weights_only=True,
                         )
spk = Tokenizer._encode_spk_emb(spk)# 字符串化
params_infer_code = ChatTTS.Chat.InferCodeParams(
            spk_emb=spk, 
        )
wavs = chat.infer(texts, params_infer_code=params_infer_code)
developer-yl commented 2 weeks ago

I am having the same issue when I try to capture an audio sample and use it as a fixed speaker. The speaker will stay the same for the first 3 tries but then suddenly change from a girl to a man's voice

from tools.audio import load_audio
spk_smp = chat.sample_audio_speaker(torch.tensor(load_audio("sample.mp3", 24000)).to('cuda'))
params_infer_code = ChatTTS.Chat.InferCodeParams(
    spk_smp=spk_smp,
    txt_smp="hands off the keyboard",
    prompt="[speed_5]",
    temperature=0.01,
    top_P = 0.01,        # top P decode
    top_K = 1,   
)
text = "You really think thats real [laugh]"
refined_text = chat.infer(text, refine_text_only=True)
print(refined_text)
params_refine_text = ChatTTS.Chat.RefineTextParams(
    prompt='[oral_1][laugh_1][break_4]',
)

wav = chat.infer(
    refined_text,
    # text,
    params_refine_text=params_refine_text,
    skip_refine_text=True,
    params_infer_code=params_infer_code,
)
Audio(wav[0], rate=24_000, autoplay=True)