FastSpeech 2 does not seem to be learning much from my dataset

PedroDKE commented 3 years ago

I'm trying to train a FastSpeech2 model on a dataset i scraped myself (around 7 hours of speech and 5000 audio files resampled to 22050hz). I do not expect great results, but at least something to work with. My preprocessing steps are similar to all the ones in the repository, except i had to put my min trimming db to 10. Besides this i used the default preprocessing/normalizing steps and when i check my files they seem to look okay-ish (not great, but the recordings weren't made in a professional studio). some samples (MFCC, raw energy and raw f0) can be seen in a previous issue #496 . My durations aren't perfect at the moment but i think theyre fine for some first training/testing. After 200k iterations on my dataset i see that the model did not learn much. The durations seem to be okay, but the model struggles to learn f0 and energy. During the validation step i added my own custom plots that also plot the gt and predicted values of these. i also use the fastspeech2.v2.yaml configuration to have less parameters to train (but v1 seemed to have the same issue so i think it might be related to my data), and changed the dataset length to fit the character length of my own in tensorflow_tts/configs/fastspeech.py.

From my tensorboard results i can see that the f0 and energy loss in the evaluation are very unstable compared to the duration. Training losses seem to be fine i think, but when i try to generate speech from a sentence seen during training the model's performance is very very poor (unrecognisable speech) and similar to what i see in my validation.

As can be seen from these 2 images from my validation set@150k steps, the model did learn some of the MFCC features but they're represented very poorly. When i look at the f0 and energy plots, they are way off and my model does not seem to be learning these the right way.

first example: b'9-00012-f000125' b'9-00012-f000125'_estimations

second example: b'9-00018-f000015' b'9-00018-f000015'_estimations

ZDisket commented 3 years ago

Sorry for the closing, that was a misclick. How did you scrape your data?

PedroDKE commented 3 years ago

No problem, I scraped them from librivox and alligned them to the book text by using aeneas. The allignments seem to be correct for 90-95%+ of my data that i'm using in this case.

ZDisket commented 3 years ago

@PedroDKE Are your durations from the aligner or from Tacotron?

PedroDKE commented 3 years ago

@ZDisket the aeneas alligner only works on sentence level. So I trained my own tacatron2 on the data to extract durations. I have talked about a bit in another issue (these are worst case scenarios, in most cases the durations cover the whole sentence, not perfect but i would assume it would be good enough to get some preliminary results) https://github.com/TensorSpeech/TensorFlowTTS/issues/496#issuecomment-779798995

ZDisket commented 3 years ago

@PedroDKE Can't help you that much since I use phoneme and MFA, but I had a similar problem in the past with some multispeaker datasets and I fixed it by changing the learning rate schedule to the one in the LibriTTS example: https://github.com/TensorSpeech/TensorFlowTTS/blob/master/examples/fastspeech2_libritts/conf/fastspeech2libritts.yaml

PedroDKE commented 3 years ago

@ZDisket my dataset is only a single speaker atm, but i plan to extend it to multiple speakers, I will try out the learning rate schedule, thank you! Is there anything you would recommend to try out if this doesnt work?

btw, i tried to use MFA too but i found out that on the character level, it does not return me a duration for each character ? for example when i have 40 characters (including punctuation) i only get like 36 duration values? this is the main reason why i did not continue to use MFA for my durations, but if this isnt a problem i can see if i can use this again.

ZDisket commented 3 years ago

@PedroDKE If you use MFA you want to use it for phoneme durations and train a model based on phonetic input, you can find extraction instructions: https://github.com/TensorSpeech/TensorFlowTTS/tree/master/examples/mfa_extraction, although I don't use those scripts specifically.

PedroDKE commented 3 years ago

@ZDisket so i adjusted the learning rate as you said, but i'm not sure if i can see any improvements after 200k steps. Here is my tensorboard

and here is the first example again at approx 145k steps. I'm not sure if the model is better or not. But the MFCC's produced still aren't really usefull (compare to other tensorboards ive seen around in this github, my eval f0/energy losses also seem to be pretty high, they should be around 0.1 more i think ?)

b'9-00012-f000125' b'9-00012-f000125'_estimations

janbijster commented 3 years ago

@PedroDKE did you manage to improve the results? I'm experiencing a similar issue, also with data from librivox.

PedroDKE commented 3 years ago

@janbijster Yes & No. No as in : for this perticular audiobook/person i did not manage to improve the results. I think the audio/speech/microphone from this person might not be suited for TTS. Mainly the MFCC extractions did not look as expessive (if you compare my screens to others MFCC's you can see that my 'waves' are less expresive imo, not sure why this is as i havent looked much into what MFCC's really are). Yes as in : When i used another book (other speaker, microphone etc) with much clearer/expressive MFCC's i was able to produce speech

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

TensorSpeech / TensorFlowTTS

FastSpeech 2 does not seem to be learning much from my dataset #500