Vidalnt / Applio

A simple, high-quality voice conversion tool focused on ease of use and performance.
https://applio.org
MIT License
1.58k stars 254 forks source link

Incorrect Pronunciation of German Umlauts #621

Closed moha6767 closed 4 weeks ago

moha6767 commented 4 weeks ago

While using the program, I've noticed that German umlauts (ä, ö, ü) are not pronounced correctly. Instead of the correct pronunciation, the umlauts are replaced with strange characters or sounds, making them almost inaudible. This significantly affects the clarity of the spoken output and hampers the usability of the program.

Steps to Reproduce:

  1. Launch the program and select any German voice.
  2. Input text that contains German umlauts (ä, ö, ü).
  3. Have the program read the text aloud.

The German umlauts should be pronounced clearly and correctly, as they appear in the text.

The umlauts are either replaced with unusual characters or are barely audible. This issue occurs regardless of the selected German voice.

I have checked all available settings and tried different voices to resolve the issue, but without success. Since I am unable to find a solution, I kindly request an investigation and resolution of this bug at the earliest convenience.

High – The incorrect pronunciation of umlauts significantly impairs the usability of the program.

Thank you in advance for addressing this issue

GabryB03 commented 4 weeks ago

This is caused by the variety of pronunciations that are present on the dataset which was used to train the pre-trained models of the actual RVCv2 architecture, which is the VCTK. In it, all the speakers were speaking in English, with several different pronunciations.

However, this excluded all the different pronunciations of some phonemes or particular sounds that are included in other languages, especially neo-latin languages like German, French, Italian, Spanish and so on.

To resolve this, users can train more data in order to "force" the model to learn the new sounds and phonemes, but this is also not easy, because the amount of data to train could be of several hours.

If you want to experiment with other pre-trained models of RVC, a developer called "MUSTAR" made his own one using many data of different languages including German (~35 hours of data for that language). You can check it at this link: https://huggingface.co/MUSTAR/Rigel-rvc-base-pretrained-model

I hope that @blaisewf will decide to train the pre-trained models for the new V3 architecture (will use BigVGAN V2 and BigVSAN) with VCTK, and with other datasets for having an amazing quality on all the languages, dialects, and pronunciations, including singing voices of all types, including the ones that are more subject to problems (subharmonic, tenor, soprano, whistling, belting, growl, scream, and so on).

blaisewf commented 4 weeks ago

While using the program, I've noticed that German umlauts (ä, ö, ü) are not pronounced correctly. Instead of the correct pronunciation, the umlauts are replaced with strange characters or sounds, making them almost inaudible. This significantly affects the clarity of the spoken output and hampers the usability of the program.

Steps to Reproduce:

  1. Launch the program and select any German voice.
  2. Input text that contains German umlauts (ä, ö, ü).
  3. Have the program read the text aloud.

The German umlauts should be pronounced clearly and correctly, as they appear in the text.

The umlauts are either replaced with unusual characters or are barely audible. This issue occurs regardless of the selected German voice.

  • Tried multiple German voices (both male and female).
  • Tested with various texts containing umlauts.

I have checked all available settings and tried different voices to resolve the issue, but without success. Since I am unable to find a solution, I kindly request an investigation and resolution of this bug at the earliest convenience.

High – The incorrect pronunciation of umlauts significantly impairs the usability of the program.

Thank you in advance for addressing this issue

well, that's a problem of the TTS, would be better if you do directly a speech to speech conversion