DigitalPhonetics / IMS-Toucan

Multilingual and Controllable Text-to-Speech Toolkit of the Speech and Language Technologies Group at the University of Stuttgart.
Apache License 2.0
1.17k stars 135 forks source link

Nasal ɛ̃ ("in" in French) mispronounced only on female voices (not on male voices) #149

Closed Ca-ressemble-a-du-fake closed 2 weeks ago

Ca-ressemble-a-du-fake commented 1 year ago

Hi,

I am using V2.4 because it works better at cloning (voice similarity) than v2.5 (it works really great). There is a problem that has already been solved regarding the nasal ɛ̃ ("in" like in "sein", "chemin", "patin", ...).

I applied only the changes in TextFrontend.py, Aligner.py, AlignerDataset.py and articulatory_features.py that were committed on March 19th (https://github.com/DigitalPhonetics/IMS-Toucan/issues/109#issuecomment-1475030586) and then launched the training from a cleaned installation (used the fine tune example, only changed the lr to 1E-5).

With this "patch" Meta model pronounces the words with "in" correctly. When I finetune Meta on small datasets the cloning (voice similarity) is awesome and the pronunciation is perfect only for male voices.

Female voices still make the above mistakes as if I did not change the 4 files cited above.

I am surprised because Meta is rather female and has correct pronunciation, and above all gender should not matter. Why do only female voices make mistakes ?

Or it has nothing to do with gender and it deals rather with insufficient dataset (female dataset lengths : 4 min and 18 min // male dataset lengths 15 min and 50 min). Or I missed to change some other files ?

Please note : the text to be read is identical for male and female voices.

Thanks in advance for your help / advice

Ca-ressemble-a-du-fake commented 1 year ago

Actually I do believe you already solved all that in v2.5 (so sorry to ask a "backward" question but v2.4 works better for me as voice similarity is concerned). Would you mind telling me which changes I should make from V2.4 to get the prononciation of ɛ̃ always correct ? Should I use v2.5 meta model (if compatible) ? (of course not since it is not PortaSpeech but Toucan).

Ca-ressemble-a-du-fake commented 1 year ago

Fine tuning (6k steps overall) Meta on Siwis dataset and then finetuning (6k steps overall) the resulting model (Siwis) on my dataset gave better results but not perfect. "pin" (pine tree) is correctly pronounced whereas "pain" (bread) is not whereas they should sound exactly alike. "Sein" alone is correctly pronounced but "sein nu" not.

Please note that Siwis model has correct pronunciation for all terms.

Flux9665 commented 11 months ago

There should in theory be no difference between male and female speakers, they are handled exactly the same. I'm not sure what causes this, maybe it really is the different amounts of finetuning data for male and female speakers. Maybe the mistake happens already in the phonemizer for some instances in the training data and it happens more often for the sentences of the female data than the male? Not sure about this problem.

Ca-ressemble-a-du-fake commented 11 months ago

Thanks for your answer. Do you have a suggestion on how to tell where the problem comes from ? Phonemizer (espeak) output is correct in all cases I refered to ("pin", "pain", "sein", "sein nu").

Flux9665 commented 11 months ago

hmm, unfortunately I don't really know how to diagnose this, because it happens somewhere in the end-to-end nature of the model. The data that goes in is fine, the labels that go in are fine, but somehow the model still decides to behave differently. And then it's not even consistent for all data, it's just a problem for female voices.

I don't know where the problem comes from, but maybe a heuristic fix could be to increase the duration of just the problematic phone. This could be done here:

https://github.com/DigitalPhonetics/IMS-Toucan/blob/b4991d48fc3f6f576f8c937cc117e1cdd923ad55/InferenceInterfaces/InferenceArchitectures/InferenceToucanTTS.py#L214

You could add a small offset, maybe +3 or so to all occurances of the problematic phone. If a phoneme is too short, then the end-to-end model sometimes just smoothes over it.

Ca-ressemble-a-du-fake commented 11 months ago

Thanks for the fix suggestion and for trying to explain the problem. Should I replace 0.0 with 3 in the "phoneme" if or create an if for statement for "nasalized"?

By the way would it help the project that I share with you the samples or the dataset ?

Flux9665 commented 10 months ago

Sorry again for taking so long to respond, what I mean is adding another if-clause to check for all of the features that describe ɛ̃ to make sure that the phoneme at this position is actually only the one where the mistake occurs and then adding 3 to the predicted duration. Maybe 3 is not ideal, this might be something worth to play around with. And it's a bit unsatisfying, because it's more a workaround like a fix and it might not even work properly.

And regarding the dataset, if I can include the dataset in future pretrained models that i upload, then more data is always nice to have :) The infrastructure I'm currently experimenting with is mostly there to handle extremely large amounts of data, so once that's done I plan to simply add as much data as I can find to the pretraining to hopefully build a super strong multilingual model.

Ca-ressemble-a-du-fake commented 10 months ago

Thanks for your reply, and no problem for the response time (it's summer holiday period here) ! I will wait for your new version. Regarding the dataset I did not think of it, but the wavs are extracted from Youtube without the speakers' authorization. So it's not possible (legal) to share it openly.