Open daxiangpanda opened 4 years ago
Yes, it will work! We would love to see it trained on multi-language datasets like "Common Voice: A Massively-Multilingual Speech Corpus" https://arxiv.org/abs/1912.06670
I trained mellotron with two different dataset(BIAOBEI and thchs30),the allignment map of biaobei is well,but the alignment map of thchs30 dataset is not as well as BIAOBEI dataset.
any idea why?
and I want to synth a simple song with model trained with BIAOBEI dataset.I use the function inference_no_attention ,and the rhythm matrix made by the midi rhythmic type but the results is not good(the word timing is good but the wave sounds terriable),any idea why?
Can you share the function of making align map?
I trained mellotron with two different dataset(BIAOBEI and thchs30),the allignment map of biaobei is well,but the alignment map of thchs30 dataset is not as well as BIAOBEI dataset.
any idea why? and I want to synth a simple song with model trained with BIAOBEI dataset.I use the function inference_no_attention ,and the rhythm matrix made by the midi rhythmic type but the results is not good(the word timing is good but the wave sounds terriable),any idea why? Can you share the function of making align map?
Hi, I had the same problem as you. I trained this model by BIAOBEI. The alignment seems to be good. But got a terrible result after I use the function inference_no_attention to synthesis a song. Do you have some ideas?
I trained mellotron with two different dataset(BIAOBEI and thchs30),the allignment map of biaobei is well,but the alignment map of thchs30 dataset is not as well as BIAOBEI dataset.
any idea why? and I want to synth a simple song with model trained with BIAOBEI dataset.I use the function inference_no_attention ,and the rhythm matrix made by the midi rhythmic type but the results is not good(the word timing is good but the wave sounds terriable),any idea why? Can you share the function of making align map?
Hi, I had the same problem as you. I trained this model by BIAOBEI. The alignment seems to be good. But got a terrible result after I use the function inference_no_attention to synthesis a song. Do you have some ideas?
Meanwhile, the alignment trained by thchs30 is same as your.
he same problem as you. I trained this model by BIAOBEI. The alignment seems to be good. But got a terri
592694590 your qq?or add wechat?
he same problem as you. I trained this model by BIAOBEI. The alignment seems to be good. But got a terri
592694590 your qq?or add wechat?
It is my QQ.
Please share the rhythm, pitch contour, mel and audio outputs that you obtained on the model trained on BIAOBEI such that we can help.
@rafaelvalle
Thank you very much.
These figures of rhythm and mel of training dataset are as follows. And the train loss is 0.22.
These figures of rhythm, pitch and mel of test dataset are as follows. The original wav is a segment of a song. I used the function of model.forward() to obtain the rhythm. Then I used the function of model.inference_noattention() to synthesis a song. The result seems to be noot good.
synthesis.zip
It seems to be an issue with the rhythm (alignment map) and pitch (f0). Rhythm: between the 0th and 50th frame there's some unexpected back and forth, after the 300th frame it's multimodal. Pitch: the F0 at the onset of the first syllable is 0 but the phoneme, I suppose, is a vowel. Mellotron invents a pitch because none exists. up
Try one of these things: 1) Run forward a few other times to see if you can get better attention. 2) Try making the distribution over each frame more peaky. This line should work:
temperature=0.1
rhythm = torch.softmax(rhythm/ temperature)
3) Try adjusting the rhythm by hand.
For the pitch contour, try changing the parameters of the pitch extraction algorithm or try to adjust the pitch contour manually.
It seems to be an issue with the rhythm (alignment map) and pitch (f0). Rhythm: between the 0th and 50th frame there's some unexpected back and forth, after the 300th frame it's multimodal. Pitch: the F0 at the onset of the first syllable is 0 but the phoneme, I suppose, is a vowel. Mellotron invents a pitch because none exists. up
Try one of these things:
- Run forward a few other times to see if you can get better attention.
- Try making the distribution over each frame more peaky. This line should work:
temperature=0.1 rhythm = torch.softmax(rhythm/ temperature)
- Try adjusting the rhythm by hand.
For the pitch contour, try changing the parameters of the pitch extraction algorithm or try to adjust the pitch contour manually.
Thank you for replying! I will try your way to solve it. Thanks for the suggestions again!
Let us know what works best!
Let us know what works best!
lack of GPU resource.only one p40.so a bit slow
Is it because BIAOBEI is single-speaker and thchs30 is multi-speaker
Is it because BIAOBEI is single-speaker and thchs30 is multi-speaker
you try mellotron with mandarin dataset?any advice?
emmmm,I didn't try, I just compared the two datasets
emmmm,I didn't try, I just compared the two datasets
ok thanks
Could you please share your prepare.py for thchs30? I'd like to try it
Could you please share your prepare.py for thchs30? I'd like to try it
you can add my qq:313514820
I'm having similar problems with a multi-speaker Dutch corpus (Mozilla Common Voice), on approximately 17 hours of audio data with 400+ speakers. I'm using the pre-trained LibriTTS model as a warmup, and I have accidentally used the English cleaner. I think the latter should not be a huge problem, as it is only taking care of the abbreviations compared to the "transliteration setting". I'm not entirely sure how the CMU Arpabet interacts with the Dutch sentences.
The loss seems to converge, although somewhere around 0.05 would be better (LJSpeech-Tacotronish level):
Attention is diagonal:
Spectrograms
The synthesised audio follows the rhythm, pitch, but it is obviously blurred, not well-articulated speech, which is somewhere midway between Dutch and English.
@karkirowle
ARPABET is based on english phonemes and I would expect the model not to work well directly on languages other than English. Try training again with p_arpabet = 0
In addition, take a look at the suggestions above.
@rafaelvalle
Note that p_arpabet does not currently do anything, see: https://github.com/NVIDIA/mellotron/pull/27 Nevertheless, I tried your other suggestions.
@karkirowle I updated the repo before sending you the message such that p_arpabet would work. Assuming that most of the words in Dutch are not in the ARPABET dictionary, changing it to p_arpabet=0 might not have large difference because the method returns the grapheme representation whenever the phoneme (arpabet) representation is not present. Still worth trying p_arpabet=0 after you pull from master.
Trimming silences can also help improve the rhythm.
any idea why I got this alignment map?
I use biaobei dataset and use phone tag as input.This alignment map is 30k result.
@rafaelvalle
any idea why I got this alignment map? I use biaobei dataset and use phone tag as input.This alignment map is 30k result. @rafaelvalle
I think the dataset is not enough.
@z592694590 Arpabet maps english graphemes to phonemes. You can try re-training with p_arpabet equal to 0 or using a representation that works with the chinese language.
any idea why I got this alignment map? I use biaobei dataset and use phone tag as input.This alignment map is 30k result. @rafaelvalle
have you solved this problem? how about training more steps?
I trained mellotron with two different dataset(BIAOBEI and thchs30),the allignment map of biaobei is well,but the alignment map of thchs30 dataset is not as well as BIAOBEI dataset.
any idea why? and I want to synth a simple song with model trained with BIAOBEI dataset.I use the function inference_no_attention ,and the rhythm matrix made by the midi rhythmic type but the results is not good(the word timing is good but the wave sounds terriable),any idea why? Can you share the function of making align map?
Hi, I had the same problem as you. I trained this model by BIAOBEI. The alignment seems to be good. But got a terrible result after I use the function inference_no_attention to synthesis a song. Do you have some ideas?
Meanwhile, the alignment trained by thchs30 is same as your.
I meet the same problem, Have you solve it? https://github.com/NVIDIA/mellotron/issues/31#issuecomment-800856284
The original tacotron2 code in https://github.com/NVIDIA/tacotron2/blob/master/text/symbols.py#L18 symbols = [_pad] + list(_special) + list(_punctuation) + list(_letters) + _arpabet
where as in this repo symbols symbols = list(_punctuation + _math + _special + _accented + _numbers + _letters) + _arpabet
as zero padding is done in the text, it pads the rest of text with the first symbol, Can this be a reason for not learning good attention when using different symbol set ?
Some learnings from my recent experiments, firstly thanks for providing this awesome code.
I was having similar issue with attention, it was not able to learn attention after 30k iterations and the attention output was very similar to the above ones when training from scratch, I found that even if you are using a different symbol set or another language initializing the model from pretrained librittts actually makes it learn attention and converge faster. You can ignore the text embedding while loading a model if you are using a using a different language.
ignore_layers=['embedding.weight','speaker_embedding.weight']
The original tacotron2 code in https://github.com/NVIDIA/tacotron2/blob/master/text/symbols.py#L18 symbols = [_pad] + list(_special) + list(_punctuation) + list(_letters) + _arpabet
where as in this repo symbols symbols = list(_punctuation + _math + _special + _accented + _numbers + _letters) + _arpabet
as zero padding is done in the text, it pads the rest of text with the first symbol, Can this be a reason for not learning good attention when using different symbol set ?
@arijitx Thanks for your reply. But i have changed the mellotron's symbol and text_to_sequence as Tacotron2.
@gongchenghhu any progress there?. I am facing a similar issue, trying to train it on Hindi but there is no progress.
like mandarin?