erew123 / alltalk_tts

AllTalk is based on the Coqui TTS engine, similar to the Coqui_tts extension for Text generation webUI, however supports a variety of advanced features, such as a settings page, low VRAM support, DeepSpeed, narrator, model finetuning, custom models, wav file maintenance. It can also be used with 3rd Party software via JSON calls.
GNU Affero General Public License v3.0
1.05k stars 113 forks source link

problem with Polish language #292

Closed waan1 closed 2 months ago

waan1 commented 2 months ago

standard installation, Linux, Nvidia 4090

Describe the bug Pronunciation of Polish language is pretty bad due to replacement of special Polish characters like ą, ę, etc. with other weird characters like Ä, Å, Ã, etc. or cutting out those characters at all. I tried all combinations of "tts_method_api_local": true, "tts_method_api_tts": false, "tts_method_xtts_local": false and -d "text_filtering=none" -d "text_filtering=standard" -d "text_filtering=html"

But could not find working combination.

To Reproduce Steps to reproduce the behavior: confignew.json {"activate": true, "autoplay": false, "branding": "AllTalk ", "narrator_enabled": false, "deepspeed_activate": false, "delete_output_wavs": "Disabled", "ip_address": "127.0.0.1", "language": "English", "low_vram": false, "local_temperature": "0.7", "local_repetition_penalty": "10.0", "tts_model_loaded": true, "tts_model_name": "tts_models/multilingual/multi-dataset/xtts_v2", "narrator_voice": "female_01.wav", "output_folder_wav": "extensions/alltalk_tts/outputs/", "output_folder_wav_standalone": "outputs/", "port_number": "7851", "remove_trailing_dots": true, "show_text": false, "tts_method_api_local": false, "tts_method_api_tts": false, "tts_method_xtts_local": true, "voice": "female_01.wav"}

curl -X POST "http://127.0.0.1:7851/api/tts-generate" -d "text_input=Idę do sklepu. Moja dziewczyna Lila poprosiła mnie o zakupy. Chce zrobić dużą kolację z okazji swoich urodzin. Zaprosiła aż dwanaście osób. Muszę kupić wiele rzeczy. Na szczęście mam listę zakupów. Mam na niej wszystko napisane." -d "text_filtering=none" -d "character_voice_gen=female_01.wav" -d "narrator_enabled=true" -d "narrator_voice_gen=male_01.wav" -d "text_not_inside=character" -d "language=pl" -d "output_file_name=myoutputfile" -d "output_file_timestamp=true" -d "autoplay=false" -d "autoplay_volume=0.8" {"status":"generate-success","output_file_path":"/home/igorm/agent-city/tts/alltalk_tts/outputs/myoutputfile_1722784234_combined.wav","output_file_url":"http://127.0.0.1:7851/audio/myoutputfile_1722784234_combined.wav","output_cache_url":"http://127.0.0.1:7851/audiocache/myoutputfile_1722784234_combined.wav"}%

[AllTalk TTSGen] Character (Text-not-inside) [AllTalk TTSGen] IdÄ do sklepu. Moja dziewczyna Lila poprosiÅa mnie o zakupy. Chce zrobiÄ duÅÄ kolacjÄ z okazji swoich urodzin. ZaprosiÅa aÅ dwanaÅcie osÃb. MuszÄ kupiÄ wiele rzeczy. Na szczÄÅcie mam listÄ zakupÃw. Mam na niej wszystko napisane. [AllTalk TTSGen] 2.58 seconds. LowVRAM: False DeepSpeed: False

Screenshots If applicable, add screenshots to help explain your problem.

Text/logs Server start log:

[AllTalk Startup] Config file check : No Updates required [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2 [WARNING] using untested triton version (2.2.0), only 1.0.0 is known to be compatible [AllTalk Startup] AllTalk startup Mode : Standalone mode [AllTalk Startup] WAV file deletion : Disabled [AllTalk Startup] DeepSpeed version : 0.14.4 [AllTalk Startup] Model is available : Checking [AllTalk Startup] Model is available : Checked [AllTalk Startup] Current Python Version : 3.11.9 [AllTalk Startup] Current PyTorch Version: 2.2.1+cu121 [AllTalk Startup] Current CUDA Version : 12.1 [AllTalk Startup] Current TTS Version : 0.22.0 [AllTalk Startup] Current TTS Version is : Up to date [AllTalk Startup] AllTalk Github updated : 1st July 2024 at 08:57 [AllTalk Startup] TTS Subprocess : Starting up [AllTalk Startup] [AllTalk Startup] AllTalk Settings & Documentation: http://127.0.0.1:7851 [AllTalk Startup] [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2 [WARNING] using untested triton version (2.2.0), only 1.0.0 is known to be compatible [AllTalk Startup] Info PortAudio library not found. If you wish to play TTS in standalone mode through the API suite [AllTalk Startup] Info please install PortAudio. This will not affect any other features or use of Alltalk. [AllTalk Startup] Info If you don't know what the API suite is, then this message is nothing to worry about. [AllTalk Startup] Info On Linux, you can use the following command to install PortAudio: [AllTalk Startup] Info sudo apt-get install portaudio19-dev [AllTalk Model] XTTSv2 Local Loading xttsv2_2.0.2 into cuda [AllTalk Model] Coqui Public Model License [AllTalk Model] https://coqui.ai/cpml.txt [AllTalk Model] Model Loaded in 6.32 seconds. [AllTalk Model] Ready

Desktop (please complete the following information): AllTalk was updated: [approx. date] Custom Python environment: [yes/no give details if yes] Text-generation-webUI was updated: [approx. date]

Additional context Add any other context about the problem here.

waan1 commented 2 months ago

I have standalone setup. English language works perfectly.

erew123 commented 2 months ago

Hi @waan1

Although you are on Linux, the principle of what I am about to show/tell you is the same. The issue you are encountering is a limitation of your Terminal/console and its current text encoding standard. This is different from how a web-browser will handle text encoding.

In this example below, you can see that I have sent your exact text through both a web browser and then also a console (windows mind, but as I say, its the same principle). You can see that the text that went through the web browser remained unchanged and the text that was sent via the CURL command at the command prompt/terminal is changed, because my command prompt/terminal is not set to work with the polish character set/extended Cyrillic alphabet.

As such, the change to the text is occurring before it even reaches AllTalk.

image

I HAVE NOT tried what I am about to put here, this is an approximate guess at what is needed on Linux, so do your own research on this...

You would probably need to install the language packs:

sudo apt-get install language-pack-pl language-pack-ru

reconfigure your locales

sudo locale-gen pl_PL.UTF-8 sudo dpkg-reconfigure locales

You can temporarily set the locale/extend the character set of any terminal you open with

export LANG=pl_PL.UTF-8
export LC_ALL=pl_PL.UTF-8

Not sure on this step, but you may need to set your keyboard configuration.... HOWEVER.... this would change the keyboard to a polish key layout (I believe), so if you dont have a polish keyboard, this would be pretty bad as you would now be typing in polish.

sudo dpkg-reconfigure keyboard-configuration

At any time you can check if your terminal is working correctly with polish

echo "Zażółć gęślą jaźń"

I would suggest a resource such as this to research exactly what you need to do https://www.baeldung.com/linux/terminal-locales-check-character-encoding

Thanks

waan1 commented 2 months ago

echo "Zażółć gęślą jaźń" worked fine, even before I tried to do anything.

I tried to implement advises, but it still did not work. Maybe because of difference between windows and linux (manjaro). I'm going to install beta version and check if it is going to work fine.

waan1 commented 2 months ago

Installed V2 BETA. Included Polish characters to the filter: [^a-zA-Z0-9\s.,;:!?-\'"$\u0400-\u04FF\u00C0-\u017F\u0150\u0151\u0170\u0171\u011E\u011F\u0130\u0131\u0900-\u097F\u2018\u2019\u201C\u201D\u3001\u3002\u3040-\u309F\u30A0-\u30FF\u4E00-\u9FFF\u3400-\u4DBF\uF900-\uFAFF\u0600-\u06FF\u0750-\u077F\uFB50-\uFDFF\uFE70-\uFEFF\uAC00-\uD7A3\u1100-\u11FF\u3130-\u318F\uFF01\uFF0c\uFF1A\uFF1B\uFF1F\u0104\u0105\u0106\u0107\u0118\u0119\u0141\u0142\u0143\u0144\u00D3\u00F3\u015A\u015B\u0179\u017A\u017B\u017C]

Now it stopped swallowing Polish characters.

Unfortunately API with xtts does not work: [GEN] Error during audio generation: penalty has to be a strictly positive float, but is 10

Apparently it expects penalty = 1.1 and the server passes penalty = 10 instead.

API works from Python correctly with piper.

SELECTED A DIFFERENT MODEL: apitts - xttsv2_2.0.3 model AND API STARTED TO WORK FINE.

erew123 commented 2 months ago

Hi @waan1 Coqui's implementation for their models specifies either 10.0 for the 2.0.2 model and 5.0 for the 2.0.3 model. Their own script/setup loads these values from the config.json that is provided with the model e.g.

2.0.3 model

https://huggingface.co/coqui/XTTS-v2/blob/main/config.json

2.0.2 model

https://huggingface.co/coqui/XTTS-v2/blob/v2.0.2/config.json

At the bottom of those files, you can find the default model settings as specified by Coqui

image

Within Coqui's own demo scripts you can see that it loads these values from the config file that is supplied with the model config.json and uses them within their own demo code for the model. Although their XTTS documentation says a lower value, these are the actual values suggested/implemented by the model and their documentation is incorrect.

https://github.com/coqui-ai/TTS/blob/dev/TTS/demos/xtts_ft_demo/xtts_demo.py

image

You will find if you use a repetition penalty of 1.0, you will get very broken up sounding or strange generation e.g. sounds that aren't even speech.

This message you posted on the discussion [GEN] Error during audio generation: penalty has to be a strictly positive float, but is 10 suggests there is something else strange going on that is possibly character set based or some hidden character that we cannot see.

As mentioned, you can specify the repetition penalty as part of the API call -d "repetition_penalty=10"

image

You can also turn on debugging to get output of what AllTalk sees from the API request arriving and then what it will hand over to Coqui's script post validation:

image

image

re Included Polish characters to the filter these should already be included in that base with \u00C0-\u017F. It was a bit hard for me to decipher what you had added, so I compared the original settings with the ones you sent by using an AI. The response is as follows:

image

I still believe that somehow, your Linux setup is changing/altering something in the character code set. How and why, I don't know exactly, but this is the only reasonable explanation I can think of at this time. I will think on it and try a few things to see if I can figure how things could be getting changed.

Thanks

erew123 commented 2 months ago

Hi @waan1

The only thing Ive been able to come up with, is that possibly your system is specifying a comma for decimal separation, so your system is sending 10,0 vs 10.0

Where the repetition penalty isn't specified in the API request, the repetition penalty is pulled directly from the AllTalk's own configuration file for XTTS. https://github.com/erew123/alltalk_tts/blob/alltalkbeta/system/tts_engines/xtts/model_settings.json

I've applied code to ensure that in cases where its not specified and pulled automatically, any comma's are replaced with periods https://github.com/erew123/alltalk_tts/commit/a8ec5e19ce7c9937f7639a322fae8f5bbd9079b9

Youre welcome to update and see if this resolves your curl request issue https://github.com/erew123/alltalk_tts/tree/alltalkbeta?tab=readme-ov-file#-updating

Thanks

erew123 commented 2 months ago

@waan1 Also, as mentioned, please check the debug setting to look what the repetition penalty shows as being sent.

Thanks