RVC-Boss / GPT-SoVITS

1 min voice data can also be used to train a good TTS model! (few shot voice cloning)
MIT License
35.67k stars 4.07k forks source link

推理时,当text相对比较短时,经常出现idx很小和idx=0的情况,导致生成的语音要么没有声音,要么变成了参考音频。 #701

Open sunbin728 opened 8 months ago

sunbin728 commented 8 months ago

curl --location 'http://127.0.0.1:9880' \ --header 'Content-Type: application/json' \ --data '{ "refer_wav_path":"output/xiaojun/1_0.wav", "prompt_text":"那为什么还有那么多商业领袖?对如此重要的用户体验问题视而不见呢?", "prompt_language": "zh", "text": "今天天气怎么样?", "text_language": "zh" }'

api.py中def get_tts_wav方法的代码: pred_semantic, idx = t2s_model.model.infer_panel( all_phoneme_ids, all_phoneme_len, prompt, bert,

prompt_phone_len=ph_offset,

            top_k=config['inference']['top_k'],
            early_stop_num=hz * max_sec)

添加的测试打印

logger.debug(f"pred_semantic.shape={pred_semantic.shape}, idx={idx}")

情况说明: 使用自己微调训练的模型推理时,当text相对比较短时,经常出现idx很小和idx=0的情况,导致生成的语音要么没有声音,要么变成了参考音频。

情况1打印: 2024-03-07 16:20:53.299 | DEBUG | main:get_tts_wav:466 - pred_semantic.shape=torch.Size([1, 205]), idx=9 情况2打印: 2024-03-07 16:26:32.582 | DEBUG | main:get_tts_wav:466 - pred_semantic.shape=torch.Size([1, 196]), idx=0 idx=0时,返回了参考音频。

sunbin728 commented 8 months ago

我发现是GPT_SoVITS/AR/models/t2s_model.py文件的infer_panel方法中,过早的samples[0, 0] == self.EOS,不知道是什么原因。

        if torch.argmax(logits, dim=-1)[0] == self.EOS or samples[0, 0] == self.EOS:
            # print(torch.argmax(logits, dim=-1)[0] == self.EOS, samples[0, 0] == self.EOS)
            stop = True
        if stop:
            # if prompts.shape[1] == y.shape[1]:
            #     y = torch.concat([y, torch.zeros_like(samples)], dim=1)
            #     print("bad zero prediction")
            if y.shape[1]==0:
                y = torch.concat([y, torch.zeros_like(samples)], dim=1)
                print("bad zero prediction")
            print(f"T2S Decoding EOS [{prefix_len} -> {y.shape[1]}]")
            break

下面是我添加的打印: 2024-03-07 17:08:45.431 | DEBUG | AR.models.t2s_model:infer_panel:408 - idx=0, logits=tensor([[ -5.4335, -9.7142, -7.5205, ..., -9.5499, -12.1951, -8.4418]], device='cuda:0'), samples=tensor([[536]], device='cuda:0', dtype=torch.int32) 2024-03-07 17:08:45.443 | DEBUG | AR.models.t2s_model:infer_panel:408 - idx=1, logits=tensor([[-2.6589, -3.9608, -5.4140, ..., -1.9371, -6.5594, 18.1556]], device='cuda:0'), samples=tensor([[1024]], device='cuda:0', dtype=torch.int32) T2S Decoding EOS [90 -> 92] 0%|▏ | 1/1500 [00:00<00:40, 36.70it/s]

KamioRinn commented 8 months ago

在最新的webui下能否复现?

sunbin728 commented 8 months ago

可以复现的,我是前几天最新拉取的最新docker镜像 感觉总结的现象是: 返回的idx很小时,无声音; 返回的idx为0时,返回的声音是参考音频;

示例1:正常推理情况

微信图片_20240308093347

T2S Decoding EOS [195 -> 242] 3%|███████ | 46/1500 [00:00<00:23, 62.07it/s] 2024-03-08 01:32:55.244 | DEBUG | AR.models.t2s_model:infer_panel:455 - ref_free=False, idx=46 返回的idx=idx-1=45

示例2:推理无声音情况

微信截图_20240308093412

T2S Decoding EOS [195 -> 198] 0%|▎ | 2/1500 [00:00<00:36, 40.59it/s] 2024-03-08 01:31:43.438 | DEBUG | AR.models.t2s_model:infer_panel:455 - ref_free=False, idx=2 返回的idx=idx-1=1

示例3:推理结果是参考音频

微信截图_20240308093643

T2S Decoding EOS [195 -> 197] 0%|▏ | 1/1500 [00:00<00:49, 30.48it/s] 2024-03-08 01:36:19.400 | DEBUG | AR.models.t2s_model:infer_panel:455 - ref_free=False, idx=1 返回的idx=idx-1=0

k18coldwl commented 8 months ago

The key reason for this problem is that the training materials are not good enough

sunbin728 commented 8 months ago

我训练的音频是单独录音的,虽然不是在专业录音棚录的,但听上去质量还是不错的,没有杂音的。

sunbin728 commented 8 months ago

还有一个现象是,如果我把我的合成文本变长,一般就不会产生这个问题。 比如上面的“今天天气怎么样?”很大概率出现问题,但如果是“今天天气怎么样?今天天气怎么样?今天天气怎么样?”这样重复几次或者“下面是我需要合成的内容:今天天气怎么样?”等,把文本变长就一般不会出现上面的问题

linrb685 commented 8 months ago

你们出现丢字的情况吗?就是一句话中,有那么几个字被跳过了

BankNatchapol commented 8 months ago

It might be related to training issues. I've encountered a similar issue and tweaking the training configuration seemed to help, though it introduced a new problem in the process. 😅