DigitalPhonetics / IMS-Toucan

Controllable and fast Text-to-Speech for over 7000 languages!
Apache License 2.0
1.47k stars 166 forks source link

Question about fintuning a language, Spanish #196

Open juangea opened 1 month ago

juangea commented 1 month ago

Hello.

I'm finding Toucan very exciting, in spanish language I see the R is never correctly pronounced, is there a way to improve it?

It seems that providing a reference audio for cloning is not enough to improve this.

Thanks for the answer and for Toucan!

juangea commented 1 month ago

To explain the problem with the R in spanish:

image

The letter R is pronounced with a hard R that we write as two R, like in "perro" (dog) currently it pronounces "pero" instead of "perro", maybe is a matter of adding something to the phonemizer or something similar, I'm not sure, just wanted to be clear.

The letter itself is called "erre" not "ere" like you can see in the picture

image

Flux9665 commented 1 month ago

Just to make sure I understand correctly: the problem is that the system produces a [r] in places where it should produce a [ɾ] instead?

I just double checked the canonical pronunciation of pero, which is [ˈpeɾo] and the canonical pronunciation of perro, which is [ˈpero]. So the system seems to use the correct phonemes.

Is it maybe a regional difference? If you set the language to Spanish, it will be Spanish as it is spoken in Spain in Europe. For the South-American varieties of Spanish, try setting the language to "Latin American Spanish" (or "spa-lat" in the internals of the toolkit).

juangea commented 1 month ago

I’m from Spain, so I’m referring to Spanish from Spain.

The spoken R in “perro” is too soft, the double R is much more pronounced. Maybe there is an r phoneme that is more pronounced or the sound for that phoneme is too soft.

I quickly prepared an example with eleven labs from the phone, the phrase is “el perro de San roque tiene rabo”, test it in toucan and you will notice the difference in the R I think, also in Roque and Rabo, all those ares hard R

https://github.com/user-attachments/assets/3b6f552b-9bc1-4289-9801-3870b435c777

Thanks for answering!

Flux9665 commented 1 month ago

I tried out a couple of sentences and a couple of speakers. The phonemes that the phonemizer produces seem correct, but the system does not always pronounce them the right way, as you say. I am not sure what the reason is, I believe it is because our training data has a mix of dialects and not just "standard" Spanish. So the model learned an association that some speakers tend to pronounce the alveolar tap [ɾ] stronger and some pronounce it weaker. This is something that should be linked to the language embedding and not the speaker embedding, but since we don't know which of the speakers in our training data speak which dialect, we just assigned the "Spanish" language to all of them and the model got confused.

Unfortunately I don't think there's an easy fix. Without better training data, the model is limited. We can however make use of the fact that the model is using articulatory configurations as its input and modify those to include the stressed flag on the alveolar taps in Spanish and see if that helps. I'll try out some things and let you know when I find something that works.

Flux9665 commented 1 month ago

I made three different attempts at improving it, but since in German we don't differentiate between the alveolar trill and the alveolar tap and the Spanish I learned in school was 10 years ago, it is hard for me to hear the difference.

Please let me know if any of the following three versions is significantly better than the others.

fixing_alveolar_trill.zip

juangea commented 1 month ago

Maybe the number 3 is better, in general is better, but the only word with a true hard R in that text is "ramas" after "verdes".

In that R, the R is not yet correct, right now it's near to how a french person whould pronnounce the hard R in spanish.

If you try the phrase I gave you you should notice a difference with the audio I gave you.

I mean this phrase: “el perro de San roque tiene rabo”

In that text, we have "peRRo", "Roque" and "Rabo", those are hard R, the R in veRde for example is better now. but the hard R is not there yet.

Thanks for the improvement! :)

Flux9665 commented 1 month ago

Number 3 was actually the control sample with no changes, so it seems the changes in 1 and 2 were not effective.

I gave it another try with the phrase you posted:

fixing_alveolar_trill.zip

juangea commented 1 month ago

hahaha, interesting :)

In this case the hard R is not correct in any of them, however in the one I can hear it harder is in number 4, in "Roque" you can hear something that is similar to how it should be spoken, not there yet, but better, in peRRo you can hear that the sound made is not the same, and it should be much more similar, in "Rabo" is also a bit better, more similar to "Roque"

Flux9665 commented 1 month ago

I think the best fix would be to increase the duration of any alveolar trill [r], but only if the language is Spanish. It's a bit difficult, but I will try in the next few days.

juangea commented 1 month ago

It will for sure improve spanish.

Thanks!