Speed of voice don't match the reference file

erew123 / alltalk_tts

AllTalk is based on the Coqui TTS engine, similar to the Coqui_tts extension for Text generation webUI, however supports a variety of advanced features, such as a settings page, low VRAM support, DeepSpeed, narrator, model finetuning, custom models, wav file maintenance. It can also be used with 3rd Party software via JSON calls.

GNU Affero General Public License v3.0

686 stars 71 forks source link

Speed of voice don't match the reference file #265

Closed luca2125 closed 1 week ago

luca2125 commented 1 week ago

Hi I have tested a reference file in English (provide by me) and choose the output in Italian.

The result is good but often the speed don't match the reference wav file .

I have used another reference file but with more duration.

The result seem worst.

I ask If I can send you the wav reference file, so you can check or give me some suggests to bypass this problem.

Thank you !!

erew123 commented 1 week ago

Hi @luca2125

I am assuming this is with the XTTS model. I dont know how good it is/isnt with other languages than English, though the model will always kind of do what it wants, being AI. Its sampling the audio and adding its own expressiveness, which can sometimes speed up/slow down what it is saying depending on how its interpreting how it thinks something should be spoken.

I can give you a few suggestions to try. First off, I would suggest moving to AllTalk v2 (if you arent already) and using the XTTS 2.0.3 model with it. 2.0.3 is slightly better trained than the 2.0.2 model, so will have better control over some speech and output.

Very soon there will also be another update within v2 where it will be able to use multiple voice samples simultaneously, which will improve the quality of the generated output, helping it match the sample voice more closely. Im just waiting on approving some code for that. https://github.com/erew123/alltalk_tts/pull/255

Finally, the other route, will be to finetune a model. This can be used to fully train a model on a voice and should result in a very close match to the original, though of course, there will be some interpretation made by the model on speed.

I would just like to be clear that the AI models are made by Coqui and not myself. https://docs.coqui.ai/en/latest/ https://github.com/coqui-ai/TTS and I have no control over how they work.

Thanks

luca2125 commented 1 week ago

Thank you for your reponse: just now tried AllTalk v2, but when I start "start_alltalk.bat" don't run.

Apart this I have do other tests on V1 version and sometimes in the output files in the final part add 5/7 seconds of sound that not exist in the reference file.

erew123 commented 1 week ago

Hi @luca2125

Ive had no problems with v2 running, you would have to run the diagnostics and provide the log file to help me understand what your issue could be there. Also, run the start_alltalk.bat file from a command prompt and dont just click on the file in the explorer window, or you wont see any errors as the window would close too quickly.

As for the output file, what AllTalk returns is what the AI model generated. If there is additional silence in your audio sample file, that may have an effect there.

Thanks

luca2125 commented 1 week ago

Hi erew123,

Here what happen with v2:

C:\prova2\alltalk_tts-alltalkbeta2\alltalk_tts-alltalkbeta>start_alltalk.bat Traceback (most recent call last): File "C:\prova2\alltalk_tts-alltalkbeta2\alltalk_tts-alltalkbeta\script.py", line 15, in import soundfile as sf ModuleNotFoundError: No module named 'soundfile'

(C:\prova2\alltalk_tts-alltalkbeta2\alltalk_tts-alltalkbeta\alltalk_environment\env) C:\prova2\alltalk_tts-alltalkbeta2\alltalk_tts-alltalkbeta>

about the reference file I have checked there is no silence in any part of file.

best regards

erew123 commented 1 week ago

Hi @luca2125

That sounds like you do not have the correct requirements installed. I assume you did go through the installation for v2. Im assuming you have set it up to run in standalone, though the atsetup.bat utility.

I would recommend deleting the alltalk environment and running setup again, however, you can also start the alltalk environment with the start_environment.bat and from the alltalk folder, move into the system, then the requirements folder and pip install -r requirements_standalone.txt to re install the requirements file.

Thanks

luca2125 commented 1 week ago

Hi erew123,

I have reapeat the installation of V2 and checked the console with more attection.

During the installation I ses this error in red:

Collecting git+https://github.com/huggingface/parler-tts.git (from -r system\requirements\requirements_standalone.txt (line 29)) Cloning https://github.com/huggingface/parler-tts.git to c:\prova2\alltalk_tts-alltalkbeta\alltalk_environment\pip-req-build-tg5ajq4a ERROR: Error [WinError 2] Impossibile trovare il file specificato while executing command git version ERROR: Cannot find command 'git' - do you have 'git' installed and in your PATH? //

unlike V2, the V1, on the other hand, was installed corectly.

any suggest ?

best regards

erew123 commented 1 week ago

Do you have git installed? You can check at a command prompt by typing git --version and seeing if you get a version number.

The quick installation instructions have details for git here https://github.com/erew123/alltalk_tts/tree/alltalkbeta?tab=readme-ov-file#-quick-setup-text-generation-webui--standalone-installation

luca2125 commented 1 week ago

Tested and work good, better then V1.

A question:

suppose that I have i subtitle files like this: // {QTtext} {font:Tahoma} {plain} {size:20} {timeScale:30} {width:160} {height:32} {timestamps:absolute} {language:0} [00:00:00.00] Hold up here a moment. [00:00:01.03]

[00:00:01.03] Lieutenant. [00:00:01.18]

[00:00:02.00] Sometimes you need to wait for slower vehicles, [00:00:04.04]

[00:00:04.04] like a service truck. [00:00:05.05] //

there is a way to create manually a script to generete audio file using timespamp ?

luca2125 commented 1 week ago

Apart this.

I have tried other tests.

Here reference and final result (sample) https://wetransfer.com/downloads/f8c9dd0ee069e3828d60ad68415a601220240707220325/c77f5f0fbf319e01a8e0e1031c55076b20240707220352/964660

here my feedback:

1) at 20-24 seconds "mes1603 (sample result.wav" there is a noise not present in the reference file. I have tried to clone in English and sometimes also add this noise. 2) Sometimes the last word is cutted. To solve the problem I have insert a comma (,) 3) When I insert dot (.) at the end of the phrase, add too much time. To solve the problem I have remove the dot.

For your opinion, in the next versions, there is a chance to solve these problems ?

erew123 commented 1 week ago

Hi @luca2125

As I mentioned, I am not responsible for the AI models or the underlying TTS engines. If you want to discuss/request things for those, I would suggest speaking to the people here https://github.com/idiap/coqui-ai-TTS/

AllTalk is handing off the text to the Coqui TTS engine/AI model and its their handling of the text that dictates the outcome/result, not AllTalk.

There are no direct improvements I can make, beyond what I already have and the ones I have mentioned earlier about it using multiple audio samples (Which I will import soon) and then Finetuning the model to understand the specifics of a voice https://github.com/erew123/alltalk_tts/tree/main?tab=readme-ov-file#-finetuning-a-model

If neither of those two solutions work, please discuss it here https://github.com/idiap/coqui-ai-TTS/ where they may be able to make a change to the TTS engine.

Thanks

luca2125 commented 1 week ago

Hi erew123,

Thank you, I will do.

Best Regards