PlayVoice / vits_chinese

Best practice TTS based on BERT and VITS with some Natural Speech Features Of Microsoft; Support ONNX streaming out!
https://huggingface.co/spaces/maxmax20160403/vits_chinese
MIT License
1.17k stars 167 forks source link

How can train on English? #68

Open nivibilla opened 1 year ago

nivibilla commented 1 year ago

Hey,

Is it possible to adapt this model to train on English dataset?

Or should I just use normal VITS?

MaxMax2016 commented 1 year ago

This model is used to solve Chinese prosodic problems. So, what is your goal?

nivibilla commented 1 year ago

I've tried to finetune VITS with a dataset. While the voice was good. The tone and prosody was not good. So I came across this model where it implements some features from Natural Speech to improve that. So I wanted to know if it's possible to finetune this model on English dataset.

nivibilla commented 1 year ago

@MaxMax2016 I understand this repo was made to solve chine prosodic problems but can it be used to English data?

MaxMax2016 commented 1 year ago

natrual_loss https://github.com/PlayVoice/vits_chinese/blob/master/train.py#L267

may this is useful for you: https://github.com/heatz123/naturalspeech

MaxMax2016 commented 1 year ago

@nivibilla i think an english bert may be usefull

The characters used by bert have multiple pronunciation units, two in Chinese and n in English.The bert vector of each character is copied and expanded according to the number of pronunciation units corresponding to the character.

yihuitang commented 1 year ago

@nivibilla i think an english bert may be usefull

The characters used by bert have multiple pronunciation units, two in Chinese and n in English.The bert vector of each character is copied and expanded according to the number of pronunciation units corresponding to the character.

@MaxMax2016 ,大佬,请教一个问题。我在尝试增加英文支持的时候遇到char_embeds.size(0) 和 len(length)不相等的问题,token和phone item对应不上,请问这种情况怎么解决好?

text: [PAD]unfriendly[PAD] phone_items: ['sil', 'AH0', 'N', 'F', 'R', 'EH1', 'N', 'D', 'L', 'IY0', 'sil'] tokens: ['[PAD]', 'u', '##n', '##fr', '##ien', '##d', '##ly', '[PAD]']

char_embeds.size(0): 8 len(length): 11 Traceback (most recent call last): File "vits_en_prepare2.py", line 125, in char_embeds = prosody.expand_for_phone(char_embeds, count_phone) File "/root/vits_chinese/bert/ProsodyModel.py", line 67, in expand_for_phone assert char_embeds.size(0) == len(length)

先tokenize再转phoneme的话,phone的音就变了,比如unfriendly的第一个toke是u,u的phoneme就变成 'Y UW1' 了