NVIDIA / mellotron

Mellotron: a multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data
BSD 3-Clause "New" or "Revised" License
853 stars 187 forks source link

try other language? #22

Open daxiangpanda opened 4 years ago

daxiangpanda commented 4 years ago

like mandarin?

rafaelvalle commented 4 years ago

Yes, it will work! We would love to see it trained on multi-language datasets like "Common Voice: A Massively-Multilingual Speech Corpus" https://arxiv.org/abs/1912.06670

daxiangpanda commented 4 years ago

I trained mellotron with two different dataset(BIAOBEI and thchs30),the allignment map of biaobei is well,but the alignment map of thchs30 dataset is not as well as BIAOBEI dataset. image any idea why? and I want to synth a simple song with model trained with BIAOBEI dataset.I use the function inference_no_attention ,and the rhythm matrix made by the midi rhythmic type but the results is not good(the word timing is good but the wave sounds terriable),any idea why? Can you share the function of making align map?

z592694590 commented 4 years ago

I trained mellotron with two different dataset(BIAOBEI and thchs30),the allignment map of biaobei is well,but the alignment map of thchs30 dataset is not as well as BIAOBEI dataset. image any idea why? and I want to synth a simple song with model trained with BIAOBEI dataset.I use the function inference_no_attention ,and the rhythm matrix made by the midi rhythmic type but the results is not good(the word timing is good but the wave sounds terriable),any idea why? Can you share the function of making align map?

Hi, I had the same problem as you. I trained this model by BIAOBEI. The alignment seems to be good. But got a terrible result after I use the function inference_no_attention to synthesis a song. Do you have some ideas?

z592694590 commented 4 years ago

I trained mellotron with two different dataset(BIAOBEI and thchs30),the allignment map of biaobei is well,but the alignment map of thchs30 dataset is not as well as BIAOBEI dataset. image any idea why? and I want to synth a simple song with model trained with BIAOBEI dataset.I use the function inference_no_attention ,and the rhythm matrix made by the midi rhythmic type but the results is not good(the word timing is good but the wave sounds terriable),any idea why? Can you share the function of making align map?

Hi, I had the same problem as you. I trained this model by BIAOBEI. The alignment seems to be good. But got a terrible result after I use the function inference_no_attention to synthesis a song. Do you have some ideas?

Meanwhile, the alignment trained by thchs30 is same as your.

daxiangpanda commented 4 years ago

he same problem as you. I trained this model by BIAOBEI. The alignment seems to be good. But got a terri

592694590 your qq?or add wechat?

z592694590 commented 4 years ago

he same problem as you. I trained this model by BIAOBEI. The alignment seems to be good. But got a terri

592694590 your qq?or add wechat?

It is my QQ.

rafaelvalle commented 4 years ago

Please share the rhythm, pitch contour, mel and audio outputs that you obtained on the model trained on BIAOBEI such that we can help.

z592694590 commented 4 years ago

@rafaelvalle Thank you very much. These figures of rhythm and mel of training dataset are as follows. And the train loss is 0.22. image image

These figures of rhythm, pitch and mel of test dataset are as follows. The original wav is a segment of a song. I used the function of model.forward() to obtain the rhythm. Then I used the function of model.inference_noattention() to synthesis a song. The result seems to be noot good. image synthesis.zip

rafaelvalle commented 4 years ago

It seems to be an issue with the rhythm (alignment map) and pitch (f0). Rhythm: between the 0th and 50th frame there's some unexpected back and forth, after the 300th frame it's multimodal. Pitch: the F0 at the onset of the first syllable is 0 but the phoneme, I suppose, is a vowel. Mellotron invents a pitch because none exists. up

Try one of these things: 1) Run forward a few other times to see if you can get better attention. 2) Try making the distribution over each frame more peaky. This line should work:

temperature=0.1
rhythm = torch.softmax(rhythm/ temperature)

3) Try adjusting the rhythm by hand.

For the pitch contour, try changing the parameters of the pitch extraction algorithm or try to adjust the pitch contour manually.

z592694590 commented 4 years ago

It seems to be an issue with the rhythm (alignment map) and pitch (f0). Rhythm: between the 0th and 50th frame there's some unexpected back and forth, after the 300th frame it's multimodal. Pitch: the F0 at the onset of the first syllable is 0 but the phoneme, I suppose, is a vowel. Mellotron invents a pitch because none exists. up

Try one of these things:

  1. Run forward a few other times to see if you can get better attention.
  2. Try making the distribution over each frame more peaky. This line should work:
temperature=0.1
rhythm = torch.softmax(rhythm/ temperature)
  1. Try adjusting the rhythm by hand.

For the pitch contour, try changing the parameters of the pitch extraction algorithm or try to adjust the pitch contour manually.

Thank you for replying! I will try your way to solve it. Thanks for the suggestions again!

rafaelvalle commented 4 years ago

Let us know what works best!

daxiangpanda commented 4 years ago

Let us know what works best!

lack of GPU resource.only one p40.so a bit slow

VirtualMoon commented 4 years ago

Is it because BIAOBEI is single-speaker and thchs30 is multi-speaker

daxiangpanda commented 4 years ago

Is it because BIAOBEI is single-speaker and thchs30 is multi-speaker

you try mellotron with mandarin dataset?any advice?

VirtualMoon commented 4 years ago

emmmm,I didn't try, I just compared the two datasets

daxiangpanda commented 4 years ago

emmmm,I didn't try, I just compared the two datasets

ok thanks

VirtualMoon commented 4 years ago

Could you please share your prepare.py for thchs30? I'd like to try it

daxiangpanda commented 4 years ago

Could you please share your prepare.py for thchs30? I'd like to try it

you can add my qq:313514820

karkirowle commented 4 years ago

I'm having similar problems with a multi-speaker Dutch corpus (Mozilla Common Voice), on approximately 17 hours of audio data with 400+ speakers. I'm using the pre-trained LibriTTS model as a warmup, and I have accidentally used the English cleaner. I think the latter should not be a huge problem, as it is only taking care of the abbreviations compared to the "transliteration setting". I'm not entirely sure how the CMU Arpabet interacts with the Dutch sentences.

The loss seems to converge, although somewhere around 0.05 would be better (LJSpeech-Tacotronish level): Screenshot from 2020-01-01 15-28-14

Attention is diagonal: Screenshot from 2020-01-01 15-30-37

Spectrograms Screenshot from 2020-01-01 15-33-38

The synthesised audio follows the rhythm, pitch, but it is obviously blurred, not well-articulated speech, which is somewhere midway between Dutch and English.

rafaelvalle commented 4 years ago

@karkirowle

ARPABET is based on english phonemes and I would expect the model not to work well directly on languages other than English. Try training again with p_arpabet = 0

In addition, take a look at the suggestions above.

karkirowle commented 4 years ago

@rafaelvalle

Note that p_arpabet does not currently do anything, see: https://github.com/NVIDIA/mellotron/pull/27 Nevertheless, I tried your other suggestions.

  1. Not using the ARPABET/cmudict, I get similar loss values as with ARPABET, I'm gonna do a more rigorous loss curve comparison in Tensorboard.
  2. The temperature trick seems to improve the rhythm.
rafaelvalle commented 4 years ago

@karkirowle I updated the repo before sending you the message such that p_arpabet would work. Assuming that most of the words in Dutch are not in the ARPABET dictionary, changing it to p_arpabet=0 might not have large difference because the method returns the grapheme representation whenever the phoneme (arpabet) representation is not present. Still worth trying p_arpabet=0 after you pull from master.

Trimming silences can also help improve the rhythm.

daxiangpanda commented 4 years ago

image any idea why I got this alignment map? I use biaobei dataset and use phone tag as input.This alignment map is 30k result. @rafaelvalle

z592694590 commented 4 years ago

image any idea why I got this alignment map? I use biaobei dataset and use phone tag as input.This alignment map is 30k result. @rafaelvalle

I think the dataset is not enough.

rafaelvalle commented 4 years ago

@z592694590 Arpabet maps english graphemes to phonemes. You can try re-training with p_arpabet equal to 0 or using a representation that works with the chinese language.

freenowill commented 4 years ago

image any idea why I got this alignment map? I use biaobei dataset and use phone tag as input.This alignment map is 30k result. @rafaelvalle

have you solved this problem? how about training more steps?

gongchenghhu commented 3 years ago

I trained mellotron with two different dataset(BIAOBEI and thchs30),the allignment map of biaobei is well,but the alignment map of thchs30 dataset is not as well as BIAOBEI dataset. image any idea why? and I want to synth a simple song with model trained with BIAOBEI dataset.I use the function inference_no_attention ,and the rhythm matrix made by the midi rhythmic type but the results is not good(the word timing is good but the wave sounds terriable),any idea why? Can you share the function of making align map?

Hi, I had the same problem as you. I trained this model by BIAOBEI. The alignment seems to be good. But got a terrible result after I use the function inference_no_attention to synthesis a song. Do you have some ideas?

Meanwhile, the alignment trained by thchs30 is same as your.

I meet the same problem, Have you solve it? https://github.com/NVIDIA/mellotron/issues/31#issuecomment-800856284

arijitx commented 3 years ago

The original tacotron2 code in https://github.com/NVIDIA/tacotron2/blob/master/text/symbols.py#L18 symbols = [_pad] + list(_special) + list(_punctuation) + list(_letters) + _arpabet

where as in this repo symbols symbols = list(_punctuation + _math + _special + _accented + _numbers + _letters) + _arpabet

as zero padding is done in the text, it pads the rest of text with the first symbol, Can this be a reason for not learning good attention when using different symbol set ?

arijitx commented 3 years ago

Some learnings from my recent experiments, firstly thanks for providing this awesome code.

I was having similar issue with attention, it was not able to learn attention after 30k iterations and the attention output was very similar to the above ones when training from scratch, I found that even if you are using a different symbol set or another language initializing the model from pretrained librittts actually makes it learn attention and converge faster. You can ignore the text embedding while loading a model if you are using a using a different language.

ignore_layers=['embedding.weight','speaker_embedding.weight']

gongchenghhu commented 3 years ago

The original tacotron2 code in https://github.com/NVIDIA/tacotron2/blob/master/text/symbols.py#L18 symbols = [_pad] + list(_special) + list(_punctuation) + list(_letters) + _arpabet

where as in this repo symbols symbols = list(_punctuation + _math + _special + _accented + _numbers + _letters) + _arpabet

as zero padding is done in the text, it pads the rest of text with the first symbol, Can this be a reason for not learning good attention when using different symbol set ?

@arijitx Thanks for your reply. But i have changed the mellotron's symbol and text_to_sequence as Tacotron2.

rasenganai commented 2 years ago

@gongchenghhu any progress there?. I am facing a similar issue, trying to train it on Hindi but there is no progress.