Closed luca2125 closed 4 months ago
Hi @luca2125
I am assuming this is with the XTTS model. I dont know how good it is/isnt with other languages than English, though the model will always kind of do what it wants, being AI. Its sampling the audio and adding its own expressiveness, which can sometimes speed up/slow down what it is saying depending on how its interpreting how it thinks something should be spoken.
I can give you a few suggestions to try. First off, I would suggest moving to AllTalk v2 (if you arent already) and using the XTTS 2.0.3 model with it. 2.0.3 is slightly better trained than the 2.0.2 model, so will have better control over some speech and output.
Very soon there will also be another update within v2 where it will be able to use multiple voice samples simultaneously, which will improve the quality of the generated output, helping it match the sample voice more closely. Im just waiting on approving some code for that. https://github.com/erew123/alltalk_tts/pull/255
Finally, the other route, will be to finetune a model. This can be used to fully train a model on a voice and should result in a very close match to the original, though of course, there will be some interpretation made by the model on speed.
I would just like to be clear that the AI models are made by Coqui and not myself. https://docs.coqui.ai/en/latest/ https://github.com/coqui-ai/TTS and I have no control over how they work.
Thanks
Thank you for your reponse: just now tried AllTalk v2, but when I start "start_alltalk.bat" don't run.
Apart this I have do other tests on V1 version and sometimes in the output files in the final part add 5/7 seconds of sound that not exist in the reference file.
Hi @luca2125
Ive had no problems with v2 running, you would have to run the diagnostics and provide the log file to help me understand what your issue could be there. Also, run the start_alltalk.bat file from a command prompt and dont just click on the file in the explorer window, or you wont see any errors as the window would close too quickly.
As for the output file, what AllTalk returns is what the AI model generated. If there is additional silence in your audio sample file, that may have an effect there.
Thanks
Hi erew123,
Here what happen with v2:
C:\prova2\alltalk_tts-alltalkbeta2\alltalk_tts-alltalkbeta>start_alltalk.bat
Traceback (most recent call last):
File "C:\prova2\alltalk_tts-alltalkbeta2\alltalk_tts-alltalkbeta\script.py", line 15, in
(C:\prova2\alltalk_tts-alltalkbeta2\alltalk_tts-alltalkbeta\alltalk_environment\env) C:\prova2\alltalk_tts-alltalkbeta2\alltalk_tts-alltalkbeta>
about the reference file I have checked there is no silence in any part of file.
best regards
Hi @luca2125
That sounds like you do not have the correct requirements installed. I assume you did go through the installation for v2. Im assuming you have set it up to run in standalone, though the atsetup.bat utility.
I would recommend deleting the alltalk environment and running setup again, however, you can also start the alltalk environment with the start_environment.bat and from the alltalk folder, move into the system, then the requirements folder and pip install -r requirements_standalone.txt
to re install the requirements file.
Thanks
Hi erew123,
I have reapeat the installation of V2 and checked the console with more attection.
During the installation I ses this error in red:
Collecting git+https://github.com/huggingface/parler-tts.git (from -r system\requirements\requirements_standalone.txt (line 29)) Cloning https://github.com/huggingface/parler-tts.git to c:\prova2\alltalk_tts-alltalkbeta\alltalk_environment\pip-req-build-tg5ajq4a ERROR: Error [WinError 2] Impossibile trovare il file specificato while executing command git version ERROR: Cannot find command 'git' - do you have 'git' installed and in your PATH? //
unlike V2, the V1, on the other hand, was installed corectly.
any suggest ?
best regards
Do you have git installed? You can check at a command prompt by typing git --version
and seeing if you get a version number.
The quick installation instructions have details for git here https://github.com/erew123/alltalk_tts/tree/alltalkbeta?tab=readme-ov-file#-quick-setup-text-generation-webui--standalone-installation
Tested and work good, better then V1.
A question:
suppose that I have i subtitle files like this: // {QTtext} {font:Tahoma} {plain} {size:20} {timeScale:30} {width:160} {height:32} {timestamps:absolute} {language:0} [00:00:00.00] Hold up here a moment. [00:00:01.03]
[00:00:01.03] Lieutenant. [00:00:01.18]
[00:00:02.00] Sometimes you need to wait for slower vehicles, [00:00:04.04]
[00:00:04.04] like a service truck. [00:00:05.05] //
there is a way to create manually a script to generete audio file using timespamp ?
Apart this.
I have tried other tests.
Here reference and final result (sample) https://wetransfer.com/downloads/f8c9dd0ee069e3828d60ad68415a601220240707220325/c77f5f0fbf319e01a8e0e1031c55076b20240707220352/964660
here my feedback:
1) at 20-24 seconds "mes1603 (sample result.wav" there is a noise not present in the reference file. I have tried to clone in English and sometimes also add this noise. 2) Sometimes the last word is cutted. To solve the problem I have insert a comma (,) 3) When I insert dot (.) at the end of the phrase, add too much time. To solve the problem I have remove the dot.
For your opinion, in the next versions, there is a chance to solve these problems ?
Hi @luca2125
As I mentioned, I am not responsible for the AI models or the underlying TTS engines. If you want to discuss/request things for those, I would suggest speaking to the people here https://github.com/idiap/coqui-ai-TTS/
AllTalk is handing off the text to the Coqui TTS engine/AI model and its their handling of the text that dictates the outcome/result, not AllTalk.
There are no direct improvements I can make, beyond what I already have and the ones I have mentioned earlier about it using multiple audio samples (Which I will import soon) and then Finetuning the model to understand the specifics of a voice https://github.com/erew123/alltalk_tts/tree/main?tab=readme-ov-file#-finetuning-a-model
If neither of those two solutions work, please discuss it here https://github.com/idiap/coqui-ai-TTS/ where they may be able to make a change to the TTS engine.
Thanks
Hi erew123,
Thank you, I will do.
Best Regards
Hi I have tested a reference file in English (provide by me) and choose the output in Italian.
The result is good but often the speed don't match the reference wav file .
I have used another reference file but with more duration.
The result seem worst.
I ask If I can send you the wav reference file, so you can check or give me some suggests to bypass this problem.
Thank you !!