Open sunbin728 opened 8 months ago
我发现是GPT_SoVITS/AR/models/t2s_model.py文件的infer_panel方法中,过早的samples[0, 0] == self.EOS,不知道是什么原因。
if torch.argmax(logits, dim=-1)[0] == self.EOS or samples[0, 0] == self.EOS:
# print(torch.argmax(logits, dim=-1)[0] == self.EOS, samples[0, 0] == self.EOS)
stop = True
if stop:
# if prompts.shape[1] == y.shape[1]:
# y = torch.concat([y, torch.zeros_like(samples)], dim=1)
# print("bad zero prediction")
if y.shape[1]==0:
y = torch.concat([y, torch.zeros_like(samples)], dim=1)
print("bad zero prediction")
print(f"T2S Decoding EOS [{prefix_len} -> {y.shape[1]}]")
break
下面是我添加的打印: 2024-03-07 17:08:45.431 | DEBUG | AR.models.t2s_model:infer_panel:408 - idx=0, logits=tensor([[ -5.4335, -9.7142, -7.5205, ..., -9.5499, -12.1951, -8.4418]], device='cuda:0'), samples=tensor([[536]], device='cuda:0', dtype=torch.int32) 2024-03-07 17:08:45.443 | DEBUG | AR.models.t2s_model:infer_panel:408 - idx=1, logits=tensor([[-2.6589, -3.9608, -5.4140, ..., -1.9371, -6.5594, 18.1556]], device='cuda:0'), samples=tensor([[1024]], device='cuda:0', dtype=torch.int32) T2S Decoding EOS [90 -> 92] 0%|â | 1/1500 [00:00<00:40, 36.70it/s]
在最新的webui下能否复现?
可以复现的,我是前几天最新拉取的最新docker镜像 感觉总结的现象是: 返回的idx很小时,无声音; 返回的idx为0时,返回的声音是参考音频;
T2S Decoding EOS [195 -> 242] 3%|███████ | 46/1500 [00:00<00:23, 62.07it/s] 2024-03-08 01:32:55.244 | DEBUG | AR.models.t2s_model:infer_panel:455 - ref_free=False, idx=46 返回的idx=idx-1=45
T2S Decoding EOS [195 -> 198] 0%|▎ | 2/1500 [00:00<00:36, 40.59it/s] 2024-03-08 01:31:43.438 | DEBUG | AR.models.t2s_model:infer_panel:455 - ref_free=False, idx=2 返回的idx=idx-1=1
T2S Decoding EOS [195 -> 197] 0%|▏ | 1/1500 [00:00<00:49, 30.48it/s] 2024-03-08 01:36:19.400 | DEBUG | AR.models.t2s_model:infer_panel:455 - ref_free=False, idx=1 返回的idx=idx-1=0
The key reason for this problem is that the training materials are not good enough
我训练的音频是单独录音的,虽然不是在专业录音棚录的,但听上去质量还是不错的,没有杂音的。
还有一个现象是,如果我把我的合成文本变长,一般就不会产生这个问题。 比如上面的“今天天气怎么样?”很大概率出现问题,但如果是“今天天气怎么样?今天天气怎么样?今天天气怎么样?”这样重复几次或者“下面是我需要合成的内容:今天天气怎么样?”等,把文本变长就一般不会出现上面的问题
你们出现丢字的情况吗?就是一句话中,有那么几个字被跳过了
It might be related to training issues. I've encountered a similar issue and tweaking the training configuration seemed to help, though it introduced a new problem in the process. 😅
curl --location 'http://127.0.0.1:9880' \ --header 'Content-Type: application/json' \ --data '{ "refer_wav_path":"output/xiaojun/1_0.wav", "prompt_text":"那为什么还有那么多商业领袖?对如此重要的用户体验问题视而不见呢?", "prompt_language": "zh", "text": "今天天气怎么样?", "text_language": "zh" }'
api.py中def get_tts_wav方法的代码: pred_semantic, idx = t2s_model.model.infer_panel( all_phoneme_ids, all_phoneme_len, prompt, bert,
prompt_phone_len=ph_offset,
添加的测试打印
logger.debug(f"pred_semantic.shape={pred_semantic.shape}, idx={idx}")
情况说明: 使用自己微调训练的模型推理时,当text相对比较短时,经常出现idx很小和idx=0的情况,导致生成的语音要么没有声音,要么变成了参考音频。
情况1打印: 2024-03-07 16:20:53.299 | DEBUG | main:get_tts_wav:466 - pred_semantic.shape=torch.Size([1, 205]), idx=9 情况2打印: 2024-03-07 16:26:32.582 | DEBUG | main:get_tts_wav:466 - pred_semantic.shape=torch.Size([1, 196]), idx=0 idx=0时,返回了参考音频。