MycroftAI / mimic3-voices

Voice models for Mimic 3 text to speech system
Creative Commons Attribution Share Alike 4.0 International
116 stars 24 forks source link

Not working for es_ES voice #6

Closed atd closed 1 year ago

atd commented 1 year ago

Describe the bug

I am trying to configure mimic3 with an es_ES voice without success in a Mark II I clone this repository (with git lfs) and copy voices/es_ES to /home/mycroft/.local/share/mycroft/mimic3/voices/

I also changed .config/mycroft/mycroft.conf with

  "tts": {
    "module": "mimic3_tts_plug",
    "mimic3_tts_plug": {
      "voice": "es_ES/carlfm_low",
      "preloaded_cache": "/opt/mycroft/preloaded_cache/Mimic3"

This is the log with original en_UK voice

Jan 28 17:29:16 localhost.localdomain python[11199]: DEBUG:mimic3_tts.tts:phonemes=[['s', 'ˈʌ', 'n'], ['l', 'ˈɑ', 's'], ['l', 
'ˈɑ', 's'], ['d', 'ˈi', 'ʃ', 'ə', 'k', 'oʊ'], ['v', 'ˈi', 'n', 't', 'ɪ', 'n', 'ˈu', 'v'], ['‖']], ids=[1, 0, 23, 0, 5, 0, 44, 
0, 20, 0, 4, 0, 18, 0, 5, 0, 33, 0, 23, 0, 4, 0, 18, 0, 5, 0, 33, 0, 23, 0, 4, 0, 10, 0, 5, 0, 15, 0, 42, 0, 36, 0, 17, 0, 21,
 0, 4, 0, 27, 0, 5, 0, 15, 0, 20, 0, 24, 0, 40, 0, 20, 0, 5, 0, 26, 0, 27, 0, 4, 0, 3, 0, 3, 0, 4, 0, 2]
Jan 28 17:29:16 localhost.localdomain python[11199]: DEBUG:mimic3_tts.voice:TTS settings: speaker-id=0, length-scale=1.0, nois
e-scale=0.667, noise-w=0.8
Jan 28 17:29:16 localhost.localdomain python[11772]: [2023-01-28 17:29:16.656] [mimic3] [debug] Copied 77 phoneme id(s) from r
Jan 28 17:29:16 localhost.localdomain python[11772]: [2023-01-28 17:29:16.656] [mimic3] [debug] Request phonemes or ids are al
ready present
Jan 28 17:29:16 localhost.localdomain python[11772]: [2023-01-28 17:29:16.656] [mimic3] [debug] Phoneme ids are already presen
Jan 28 17:29:16 localhost.localdomain python[11772]: [2023-01-28 17:29:16.656] [mimic3] [debug] Synthesizing audio with 77 pho
neme id(s)
Jan 28 17:29:16 localhost.localdomain python[11772]: [2023-01-28 17:29:16.656] [mimic3] [debug] Allocating tensors
Jan 28 17:29:16 localhost.localdomain python[11772]: [2023-01-28 17:29:16.656] [mimic3] [debug] Running inference
Jan 28 17:29:18 localhost.localdomain python[11772]: [2023-01-28 17:29:18.417] [mimic3] [debug] Inference complete
Jan 28 17:29:18 localhost.localdomain python[11772]: [2023-01-28 17:29:18.417] [mimic3] [debug] Writing WAV file: /tmp/tmpht_i
Jan 28 17:29:18 localhost.localdomain python[11772]: [2023-01-28 17:29:18.958] [mimic3] [debug] Cleaning up
Jan 28 17:29:18 localhost.localdomain python[11772]: [2023-01-28 17:29:18.958] [mimic3] [info] Real-time factor: 0.61173459654
50164 (infer=1.761351749, audio=2.8792743764172335)
Jan 28 17:29:18 localhost.localdomain python[11772]: [2023-01-28 17:29:18.958] [mimic3] [info] Wrote /tmp/tmpht_ie02f*.wav
Jan 28 17:29:18 localhost.localdomain python[11199]: DEBUG:mimic3_tts.voice:RTF: 0.40078302642928665
Jan 28 17:29:18 localhost.localdomain python[11199]: DEBUG:audio:Submitted TTS chunk 1/1 for session 01da150a-2613-471d-b4e0-4
04c363230a0: Son las las dieciocho veintinueve
Jan 28 17:29:18 localhost.localdomain python[11199]: INFO:mycroft.util.log:Queued TTS chunk 1/1: file:///tmp/mycroft/cache/tts
/mimic3_tts_plug/12d12a30f5fe5c86e943c3fcd13f3a89.wav (session=01da150a-2613-471d-b4e0-404c363230a0): Son las las dieciocho veintinueve

vs logs with es_ES voice

Jan 28 17:46:04 localhost.localdomain python[20848]: DEBUG:audio:Synthesizing: Ahora mismo son las las dieciocho cuar[44/1909$
Jan 28 17:46:04 localhost.localdomain python[20848]: DEBUG:gruut.text_processor:No custom settings for language es_ES (es-es).
 Creating default settings.                                                                                                   
Jan 28 17:46:04 localhost.localdomain python[20848]: DEBUG:mycroft.util.log:Started TTS session 792fa7da-8f68-48ca-b514-3d5ab5
Jan 28 17:46:04 localhost.localdomain python[20848]: DEBUG:gruut.utils:(es-es) couldn't import module gruut_lang_es           
Jan 28 17:46:04 localhost.localdomain python[20848]: DEBUG:gruut.utils:(es-es) searching [PosixPath('/home/mycroft/.config/gru
ut'), PosixPath('/opt/mycroft-dinkum/.venv/lib/python3.8/site-packages/data')] for language file(s)                           
Jan 28 17:46:04 localhost.localdomain python[20848]: DEBUG:mimic3_tts.tts:phonemes=[['‖']], ids=[1, 0, 3, 0, 3, 0, 4, 0, 2]   
Jan 28 17:46:04 localhost.localdomain python[20848]: DEBUG:mimic3_tts.voice:TTS settings: speaker-id=0, length-scale=1.0, nois
e-scale=0.667, noise-w=0.8                                                                                                                        
Jan 28 17:46:04 localhost.localdomain python[21421]: [2023-01-28 17:46:04.431] [mimic3] [debug] Copied 9 phoneme id(s) from re
Jan 28 17:46:04 localhost.localdomain python[21421]: [2023-01-28 17:46:04.431] [mimic3] [debug] Request phonemes or ids are al
ready present
Jan 28 17:46:04 localhost.localdomain python[21421]: [2023-01-28 17:46:04.431] [mimic3] [debug] Phoneme ids are already presen
Jan 28 17:46:04 localhost.localdomain python[21421]: [2023-01-28 17:46:04.431] [mimic3] [debug] Synthesizing audio with 9 phon
eme id(s)
Jan 28 17:46:04 localhost.localdomain python[21421]: [2023-01-28 17:46:04.431] [mimic3] [debug] Allocating tensors
Jan 28 17:46:04 localhost.localdomain python[21421]: [2023-01-28 17:46:04.431] [mimic3] [debug] Running inference
Jan 28 17:46:04 localhost.localdomain python[20851]: DEBUG:mycroft.util.log:Audio finished: 792fa7da-8f68-48ca-b514-3d5ab535fd
Jan 28 17:46:04 localhost.localdomain python[21421]: [2023-01-28 17:46:04.722] [mimic3] [debug] Inference complete
Jan 28 17:46:04 localhost.localdomain python[21421]: [2023-01-28 17:46:04.722] [mimic3] [debug] Writing WAV file: /tmp/tmpgt12
Jan 28 17:46:04 localhost.localdomain python[21421]: [2023-01-28 17:46:04.781] [mimic3] [debug] Cleaning up
Jan 28 17:46:04 localhost.localdomain python[21421]: [2023-01-28 17:46:04.781] [mimic3] [info] Real-time factor: 0.89491416020
50782 (infer=0.290918127, audio=0.3250793650793651)
Jan 28 17:46:04 localhost.localdomain python[21421]: [2023-01-28 17:46:04.781] [mimic3] [info] Wrote /tmp/tmpgt12jn6y*.wav
Jan 28 17:46:04 localhost.localdomain python[20848]: DEBUG:mimic3_tts.voice:RTF: 0.546490836037794
Jan 28 17:46:04 localhost.localdomain python[20848]: DEBUG:audio:Submitted TTS chunk 1/1 for session 792fa7da-8f68-48ca-b514-3
d5ab535fde7: Ahora mismo son las las dieciocho cuarenta y seis
Jan 28 17:46:04 localhost.localdomain python[20848]: INFO:mycroft.util.log:Queued TTS chunk 1/1: file:///tmp/mycroft/cache/tts
/mimic3_tts_plug/afc83eb3ed54403bed497929e585eb2d.wav (session=792fa7da-8f68-48ca-b514-3d5ab535fde7): Ahora mismo son las las dieciocho cuarenta y seis

Seems like phonemes are not correctly generated?

Expected behavior

I should hear the Spanish voice of the audio. Hear nothing

Environment (please complete the following information):

atd commented 1 year ago

Running pip install mycroft-plugin-tts-mimic3[en,es] installed required gruut package
