RVC-Boss / GPT-SoVITS

1 min voice data can also be used to train a good TTS model! (few shot voice cloning)
MIT License
32.45k stars 3.74k forks source link

size mismatch in 3-get-semantic.py #774

Open GeleiaComPepino opened 6 months ago

GeleiaComPepino commented 6 months ago

I added a new dictionary IPA for portuguese in symbols.py, added portuguese.py to process text in IPA, but i'm receiving this error: "/usr/bin/python3" GPT_SoVITS/prepare_datasets/3-get-semantic.py "/usr/bin/python3" GPT_SoVITS/preparedatasets/3-get-semantic.py ['!', ',', '-', '.', '55', '?', 'AA', 'AA0', 'AA1', 'AA2', 'AE0', 'AE1', 'AE2', 'AH0', 'AH1', 'AH2', 'AO0', 'AO1', 'AO2', 'AW0', 'AW1', 'AW2', 'AY0', 'AY1', 'AY2', 'B', 'CH', 'D', 'DH', 'E1', 'E2', 'E3', 'E4', 'E5', 'EE', 'EH0', 'EH1', 'EH2', 'ER', 'ER0', 'ER1', 'ER2', 'EY0', 'EY1', 'EY2', 'En1', 'En2', 'En3', 'En4', 'En5', 'F', 'G', 'HH', 'I', 'IH', 'IH0', 'IH1', 'IH2', 'IY0', 'IY1', 'IY2', 'JH', 'K', 'L', 'M', 'N', 'NG', 'OO', 'OW0', 'OW1', 'OW2', 'OY0', 'OY1', 'OY2', 'P', 'R', 'S', 'SH', 'SP', 'SP2', 'SP3', 'T', 'TH', 'U', 'UH0', 'UH1', 'UH2', 'UNK', 'UW0', 'UW1', 'UW2', 'V', 'W', 'Y', 'Z', 'ZH', '', 'a', 'a1', 'a2', 'a3', 'a4', 'a5', 'ai1', 'ai2', 'ai3', 'ai4', 'ai5', 'an1', 'an2', 'an3', 'an4', 'an5', 'ang1', 'ang2', 'ang3', 'ang4', 'ang5', 'ao1', 'ao2', 'ao3', 'ao4', 'ao5', 'b', 'by', 'c', 'ch', 'cl', 'd', 'dy', 'e', 'e1', 'e2', 'e3', 'e4', 'e5', 'ei1', 'ei2', 'ei3', 'ei4', 'ei5', 'en1', 'en2', 'en3', 'en4', 'en5', 'eng1', 'eng2', 'eng3', 'eng4', 'eng5', 'er1', 'er2', 'er3', 'er4', 'er5', 'f', 'g', 'gy', 'h', 'hy', 'i', 'i01', 'i02', 'i03', 'i04', 'i05', 'i1', 'i2', 'i3', 'i4', 'i5', 'ia1', 'ia2', 'ia3', 'ia4', 'ia5', 'ian1', 'ian2', 'ian3', 'ian4', 'ian5', 'iang1', 'iang2', 'iang3', 'iang4', 'iang5', 'iao1', 'iao2', 'iao3', 'iao4', 'iao5', 'ie1', 'ie2', 'ie3', 'ie4', 'ie5', 'in1', 'in2', 'in3', 'in4', 'in5', 'ing1', 'ing2', 'ing3', 'ing4', 'ing5', 'iong1', 'iong2', 'iong3', 'iong4', 'iong5', 'ir1', 'ir2', 'ir3', 'ir4', 'ir5', 'iu1', 'iu2', 'iu3', 'iu4', 'iu5', 'j', 'k', 'ky', 'l', 'm', 'my', 'n', 'ny', 'o', 'o1', 'o2', 'o3', 'o4', 'o5', 'ong1', 'ong2', 'ong3', 'ong4', 'ong5', 'ou1', 'ou2', 'ou3', 'ou4', 'ou5', 'p', 'py', 'q', 'r', 'ry', 's', 'sh', 't', 'ts', 'u', 'u1', 'u2', 'u3', 'u4', 'u5', 'ua1', 'ua2', 'ua3', 'ua4', 'ua5', 'uai1', 'uai2', 'uai3', 'uai4', 'uai5', 'uan1', 'uan2', 'uan3', 'uan4', 'uan5', 'uang1', 'uang2', 'uang3', 'uang4', 'uang5', 'ui1', 'ui2', 'ui3', 'ui4', 'ui5', 'un1', 'un2', 'un3', 'un4', 'un5', 'uo1', 'uo2', 'uo3', 'uo4', 'uo5', 'v', 'v1', 'v2', 'v3', 'v4', 'v5', 'van1', 'van2', 'van3', 'van4', 'van5', 've1', 've2', 've3', 've4', 've5', 'vn1', 'vn2', 'vn3', 'vn4', 'vn5', 'w', 'x', 'y', 'z', 'zh', 'õ', 'ü', 'ɐ', 'ɔ', 'ɛ', 'ɡ', 'ɾ', 'ʒ', '̃', '…'] ['!', ',', '-', '.', '55', '?', 'AA', 'AA0', 'AA1', 'AA2', 'AE0', 'AE1', 'AE2', 'AH0', 'AH1', 'AH2', 'AO0', 'AO1', 'AO2', 'AW0', 'AW1', 'AW2', 'AY0', 'AY1', 'AY2', 'B', 'CH', 'D', 'DH', 'E1', 'E2', 'E3', 'E4', 'E5', 'EE', 'EH0', 'EH1', 'EH2', 'ER', 'ER0', 'ER1', 'ER2', 'EY0', 'EY1', 'EY2', 'En1', 'En2', 'En3', 'En4', 'En5', 'F', 'G', 'HH', 'I', 'IH', 'IH0', 'IH1', 'IH2', 'IY0', 'IY1', 'IY2', 'JH', 'K', 'L', 'M', 'N', 'NG', 'OO', 'OW0', 'OW1', 'OW2', 'OY0', 'OY1', 'OY2', 'P', 'R', 'S', 'SH', 'SP', 'SP2', 'SP3', 'T', 'TH', 'U', 'UH0', 'UH1', 'UH2', 'UNK', 'UW0', 'UW1', 'UW2', 'V', 'W', 'Y', 'Z', 'ZH', '_', 'a', 'a1', 'a2', 'a3', 'a4', 'a5', 'ai1', 'ai2', 'ai3', 'ai4', 'ai5', 'an1', 'an2', 'an3', 'an4', 'an5', 'ang1', 'ang2', 'ang3', 'ang4', 'ang5', 'ao1', 'ao2', 'ao3', 'ao4', 'ao5', 'b', 'by', 'c', 'ch', 'cl', 'd', 'dy', 'e', 'e1', 'e2', 'e3', 'e4', 'e5', 'ei1', 'ei2', 'ei3', 'ei4', 'ei5', 'en1', 'en2', 'en3', 'en4', 'en5', 'eng1', 'eng2', 'eng3', 'eng4', 'eng5', 'er1', 'er2', 'er3', 'er4', 'er5', 'f', 'g', 'gy', 'h', 'hy', 'i', 'i01', 'i02', 'i03', 'i04', 'i05', 'i1', 'i2', 'i3', 'i4', 'i5', 'ia1', 'ia2', 'ia3', 'ia4', 'ia5', 'ian1', 'ian2', 'ian3', 'ian4', 'ian5', 'iang1', 'iang2', 'iang3', 'iang4', 'iang5', 'iao1', 'iao2', 'iao3', 'iao4', 'iao5', 'ie1', 'ie2', 'ie3', 'ie4', 'ie5', 'in1', 'in2', 'in3', 'in4', 'in5', 'ing1', 'ing2', 'ing3', 'ing4', 'ing5', 'iong1', 'iong2', 'iong3', 'iong4', 'iong5', 'ir1', 'ir2', 'ir3', 'ir4', 'ir5', 'iu1', 'iu2', 'iu3', 'iu4', 'iu5', 'j', 'k', 'ky', 'l', 'm', 'my', 'n', 'ny', 'o', 'o1', 'o2', 'o3', 'o4', 'o5', 'ong1', 'ong2', 'ong3', 'ong4', 'ong5', 'ou1', 'ou2', 'ou3', 'ou4', 'ou5', 'p', 'py', 'q', 'r', 'ry', 's', 'sh', 't', 'ts', 'u', 'u1', 'u2', 'u3', 'u4', 'u5', 'ua1', 'ua2', 'ua3', 'ua4', 'ua5', 'uai1', 'uai2', 'uai3', 'uai4', 'uai5', 'uan1', 'uan2', 'uan3', 'uan4', 'uan5', 'uang1', 'uang2', 'uang3', 'uang4', 'uang5', 'ui1', 'ui2', 'ui3', 'ui4', 'ui5', 'un1', 'un2', 'un3', 'un4', 'un5', 'uo1', 'uo2', 'uo3', 'uo4', 'uo5', 'v', 'v1', 'v2', 'v3', 'v4', 'v5', 'van1', 'van2', 'van3', 'van4', 'van5', 've1', 've2', 've3', 've4', 've5', 'vn1', 'vn2', 'vn3', 'vn4', 'vn5', 'w', 'x', 'y', 'z', 'zh', 'õ', 'ü', 'ɐ', 'ɔ', 'ɛ', 'ɡ', 'ɾ', 'ʒ', '̃', '…'] /usr/local/lib/python3.10/dist-packages/torch/nn/utils/weight_norm.py:28: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm. warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.") /usr/local/lib/python3.10/dist-packages/torch/nn/utils/weight_norm.py:28: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm. warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.") Traceback (most recent call last): File "/content/GPT-SoVITS/GPT_SoVITS/prepare_datasets/3-get-semantic.py", line 62, in vq_model.load_state_dict( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 2153, in load_state_dict Traceback (most recent call last): File "/content/GPT-SoVITS/GPT_SoVITS/prepare_datasets/3-get-semantic.py", line 62, in vq_model.load_state_dict( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 2153, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for SynthesizerTrn: size mismatch for enc_p.text_embedding.weight: copying a param with shape torch.Size([322, 192]) from checkpoint, the shape in current model is torch.Size([331, 192]). raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(

RuntimeError: Error(s) in loading state_dict for SynthesizerTrn: size mismatch for enc_p.text_embedding.weight: copying a param with shape torch.Size([322, 192]) from checkpoint, the shape in current model is torch.Size([331, 192]).

I'm using the chinese hubert model and the chinese bert model, i believe that was a model error, anyone can help?

SapphireLab commented 6 months ago

You couldn't add the symbol by yourself because the step 3 using SynthesizerTrn (the generator of SoVITS) whose weights has been fixed by pretrained model. The weights' shape is related to the number of symbols.

GeleiaComPepino commented 6 months ago

You couldn't add the symbol by yourself because the step 3 using SoVITS which is fixed by pretrained model.

So how do I put the IPA in the code?

SapphireLab commented 6 months ago

I think you should use a pretrained model for portuguese to replace, or use the existing symbols to express portuguese.

GeleiaComPepino commented 6 months ago

I think you should use a pretrained model for portuguese to replace, or use the existing symbols to express portuguese.

So why does English (cmudict) work with Chinese pretrain and Portuguese wouldn't work?

SapphireLab commented 6 months ago

The symbols used to trained have contain ARPAbet (for American English), consonants and vowels (for Chinese) and others (for Japanese) So currently you can train and infer within these three language and they will be pronounced correctly.

SapphireLab commented 6 months ago

I am not sure the phonemes of Portuguese can be tranferred to the ARPAbet? If can, you can write some code to transfer.

GeleiaComPepino commented 6 months ago

I tried to use ARPAbet in english with similar phoneme sound, but the accent is very different.

GeleiaComPepino commented 6 months ago

I was thinking about using wav2vec model, because there is no Portuguese Hubert model, only BERT.

SapphireLab commented 6 months ago

I was thinking about using wav2vec model, because there is no Portuguese Hubert model, only BERT.

do you want to replace the hubert of Step 2 to extract feature and then input to the SynthesizerTrn? I think it maybe helpful for getting text in Portuguese but I am not sure if it is work for pronounce.

RVC-Boss commented 6 months ago

Language of training dataset is not significant for Hubert. And in general hubert training needs at least several thousands hours speech data. You can use open source pretrained Hubert trained with English or Chinese.

GeleiaComPepino commented 5 months ago

Language of training dataset is not significant for Hubert. And in general hubert training needs at least several thousands hours speech data. You can use open source pretrained Hubert trained with English or Chinese.

But I need to use IPA in Portuguese and pronounce it with a Brazilian accent, so I need a Hubert model that supports Portuguese, since there is no way to use Portuguese in this huberts, only English or Chinese.

GeleiaComPepino commented 2 months ago

I think you should use a pretrained model for portuguese to replace, or use the existing symbols to express portuguese.

Do you know what pretrain model i need to train to change the values of weight?

chunping-xt commented 2 months ago

The symbols used to trained have contain ARPAbet (for American English), consonants and vowels (for Chinese) and others (for Japanese) So currently you can train and infer within these three language and they will be pronounced correctly.

I am following your advice, but I accidentally used 'v_without_tone' symbols and the symbols size is 355 > 322. The error is as follows: ... size mismatch for enc_p.text_embedding.weight: copying a param with shape torch.Size([322, 192]) from checkpoint, the shape in current model is torch.Size([355, 192]).RuntimeError: Error(s) in loading state_dict for SynthesizerTrn: size mismatch for enc_p.text_embedding.weight: copying a param with shape torch.Size([322, 192]) from checkpoint, the shape in current model is torch.Size([355, 192]). ... Now what is the fix? do I need to redo my phoneme set with 322 symbols or do you have a pretrained model trained with 355 symbols?

jiangyiqiao commented 2 months ago

I tried to change the model pre-load, and it can run successfully, someone may have a try~~

image image
chunping-xt commented 2 months ago

@jiangyiqiao awesome, i will try it right away. Thanks!

SapphireLab commented 2 months ago

I think you should use a pretrained model for portuguese to replace, or use the existing symbols to express portuguese.

Do you know what pretrain model i need to train to change the values of weight?

I found a multilingual HuBERT mHuBERT-147 in InterSpeech 2024 currently, maybe you can do some attempt.