DigitalPhonetics / IMS-Toucan

Multilingual and Controllable Text-to-Speech Toolkit of the Speech and Language Technologies Group at the University of Stuttgart.
Apache License 2.0
1.28k stars 144 forks source link

Palatalized consonants in Slavic languages #112

Closed quernd closed 1 year ago

quernd commented 1 year ago

This relates to #111, but since it's an issue with languages already supported rather than future expansion, I decided to open a separate issue.

First off, thanks for your great work. Very approachable code, lots of interesting linguistically informed choices, and the pre-trained models are extremely valuable.

I worked a bit with the PortaSpeech Meta model for Polish TTS and I noticed an issue. Polish has the notion of "palatalized consonants" some of which are represented by [ʲ] (superscript j) in espeak. This distinction is also sometimes called "hard" vs "soft" (palatalized) consonants. There's this minimal pair for instance:

pasta [pasta], piasta [pʲasta]

Unfortunately, right now the TextFrontend removes all occurrences of [ʲ]: https://github.com/DigitalPhonetics/IMS-Toucan/blob/4f17ce25dee0fdedd072bbaff7f24176b3506f34/Preprocessing/TextFrontend.py#L380-L385 It can be reproduced in the ThisSpeakerDoesNotExist demo. You will notice no difference in the spectrogram or the waveform of "pasta" and "piasta", simply because the phoneme input to the model will be identical.

There are competing phonological analyses on palatalized consonants, but one valid way to deal with this is to is a "decomposed palatalization" where we simply replace [ʲ] by [j]. See e.g. this footnote in the Wiktionary guide to Polish pronunciation which uses [j] throughout rather than [ʲ]:

so pies is pronounced as if it were spelt ⟨pjes⟩

Replacing [ʲ] by [j] in the TextFrontend instead of dropping it already gives us a decent pronunciation thanks to the robustness of your multilingual approach, and finetuning is completely unproblematic.

I can open a PR with this tiny change, but I'm not sure that this is the right solution in the general case. I also tried a more feature-based implementation. There is no "palatalized" feature so I re-used the "palatal" feature to add to the preceding consonant. It's a bit hacky, and also finetuning was not enough to achieve satisfactory results. Likely, some more extensive training or retraining from scratch is required because this feature combination was not present in the training data. I'm curious to hear what you think about this since you seem to be open to the idea of adding articulatory features.

Lastly, palatalization also occurs in Russian (disclaimer: I know no Russian) and there seem to exist minimal pairs, e.g.:

быть [bɨtʲ], быт [bɨt].

For more examples in Russian see this discussion.

Flux9665 commented 1 year ago

Thank you for the detailed explanation and sorry for my late reply, the last weeks were very busy.

The decomposed palatalization sounds like a good fix, I will use it for training the new multilingual model for the next release. Using the articulatory features would enable us to set the palatal feature to True, like you mentioned, but this requires some more hacky modifications to the aligner, because the aligner works only with the identities of phonemes, rather than the feature vectors. The decomposed approach to palatalization would completely avoid this problem, so I like it. For the other unsupported characters I still have to find a clean solution, but I will include the palatalization in the next release (in just a few days hopefully).

Flux9665 commented 1 year ago

The current version of the toolkit and the associated pretrained model are now trained with decomposed palatalization. I asked a Russian colleague and he says it's definitely an improvement, so thank you once again for the suggestion!

quernd commented 1 year ago

Thank you! I tried out the new model and it's a great improvement for Polish too.