georgecsaszargit / tortoise_audio_book_creator

This is a fork of tortoise tts fast to easily create audio books locally on your computer
GNU Affero General Public License v3.0
7 stars 1 forks source link

install start? #1

Open jjsmcneil1113 opened 7 months ago

jjsmcneil1113 commented 7 months ago

Sorry for newbie question, but I believe I have followed your install instructions (tried both on ubuntu and windows). how to start your audiobook application? I tried running python scripts/start.py but got error "line 392 sd.play(data, samplerate) TabError: inconsistent use of tabs and spaces in indentation"

is there a youtube video?

thanks

georgecsaszargit commented 6 months ago

Hi,

What is your GPU? Do you have the same package version as I shared on github?

George Csaszar Our Planet Recycling SF LLC 445V Bayshore blvd. SF CA 94124 Office: 415-866-6102 Direct No: 415-914-2819 Fax: 415-480-8343

On Tue, Mar 5, 2024 at 10:06 AM jjsmcneil1113 @.***> wrote:

Sorry for newbie question, but I believe I have followed your install instructions (tried both on ubuntu and windows). how to start your audiobook application? I tried running python scripts/start.py but got error "line 392 sd.play(data, samplerate) TabError: inconsistent use of tabs and spaces in indentation"

is there a youtube video?

thanks

— Reply to this email directly, view it on GitHub https://github.com/georgecsaszargit/tortoise_audio_book_creator/issues/1, or unsubscribe https://github.com/notifications/unsubscribe-auth/BDVQGYMN7UA4XVPIRZ4HHN3YWYCR3AVCNFSM6AAAAABEHTVDJ2VHI2DSMVQWIX3LMV43ASLTON2WKOZSGE3DSOBUGQ4TKNY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

jjsmcneil1113 commented 6 months ago

Hi,

my PC: RTX4090, AMD7950X I cloned your repository last week, is that what you mean?

is there a start.bat type of file?

sorry again for my newbie-ness :)

georgecsaszargit commented 6 months ago

Hi, I will try to help you on the weekend.

Have a nice day

On Mon, Mar 18, 2024 at 2:59 PM jjsmcneil1113 @.***> wrote:

Hi,

my PC: RTX4090, AMD7950X I cloned your repository last week, is that what you mean?

is there a start.bat type of file?

sorry again for my newbie-ness :)

— Reply to this email directly, view it on GitHub https://github.com/georgecsaszargit/tortoise_audio_book_creator/issues/1#issuecomment-2005119106, or unsubscribe https://github.com/notifications/unsubscribe-auth/BDVQGYILY46IIWLSCWHJHVDYY5PT3AVCNFSM6AAAAABEHTVDJ2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBVGEYTSMJQGY . You are receiving this because you commented.Message ID: @.*** com>

georgecsaszargit commented 5 months ago

Hi,

I am sorry, but I was busy with work. I realized that my installation instructions don't work. I revised them but I am still testing them and I am also making a video. I believe I will have it done by next week. I will let you know.

Cheers

On Mon, Mar 18, 2024 at 2:59 PM jjsmcneil1113 @.***> wrote:

Hi,

my PC: RTX4090, AMD7950X I cloned your repository last week, is that what you mean?

is there a start.bat type of file?

sorry again for my newbie-ness :)

— Reply to this email directly, view it on GitHub https://github.com/georgecsaszargit/tortoise_audio_book_creator/issues/1#issuecomment-2005119106, or unsubscribe https://github.com/notifications/unsubscribe-auth/BDVQGYILY46IIWLSCWHJHVDYY5PT3AVCNFSM6AAAAABEHTVDJ2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBVGEYTSMJQGY . You are receiving this because you commented.Message ID: @.*** com>

jjsmcneil1113 commented 5 months ago

thanks much! would you have any advice about issues with tortoise tts voice cloning artifacts, such as words being repeated or the ends of phrases being clipped off? I've paid attention to my dataset quality, insuring no noise music reverb, I have listened to the dataset audio segments to ensure voice is not clipped on the dataset, expanded size of training dataset to 3 hours voice, trained 500 epochs, all of which mildly improved but not eliminated the problems when I run inferences. thank so much!

jjsmcneil1113 commented 5 months ago

I also incorporated phrase break predictions models which helped with long sentences, but again this improvement was insufficient

georgecsaszargit commented 5 months ago

Hi,

New instructions are uploaded with video as well. I wouldn't waste time fine tuning models / voice cloning, Watch the video and you will see that you can achieve best results with just voice latent files. I addressed a lot of the issues you mentioned.

https://github.com/georgecsaszargit/tortoise_audio_book_creator/blob/master/ https://youtu.be/BCCMB0p4fC8?si=5pHqHb8nZCSa_ExO

Hope it helps

On Fri, Apr 12, 2024 at 8:31 AM jjsmcneil1113 @.***> wrote:

thanks much! would you have any advice about issues with tortoise tts voice cloning artifacts, such as words being repeated or the ends of phrases being clipped off? I've paid attention to my dataset quality, insuring no noise music reverb, I have listened to the dataset audio segments to ensure voice is not clipped on the dataset, expanded size of training dataset to 3 hours voice, trained 500 epochs, all of which mildly improved but not eliminated the problems when I run inferences. thank so much!

— Reply to this email directly, view it on GitHub https://github.com/georgecsaszargit/tortoise_audio_book_creator/issues/1#issuecomment-2051986778, or unsubscribe https://github.com/notifications/unsubscribe-auth/BDVQGYMTVJWJ3FRRWK2ALVLY47455AVCNFSM6AAAAABEHTVDJ2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJRHE4DMNZXHA . You are receiving this because you commented.Message ID: @.*** com>

jjsmcneil1113 commented 5 months ago

WOW, THANKYOU MAN!

I myself have some work to finish first, but I can't wait to try it out.

I did watch your video. When you say voice latent file, does this solve the problem of words being repeated or ends of phrases being clipped?

THANKS AGAIN!!!

georgecsaszargit commented 5 months ago

No. Voice latent file just makes the speech style a lot more natural if you create it from hours of audio. The issue that you are referring to is solved by using my fork of tortoise. I have self-correcting rounds that check the generated speech and retry it with a different seed until it is fixed or until the self correnting rounds max value is reached. Basically if a sentence is generated with random seed, and the speech has issues, then the only way to fix it is to generate again with a different seed and hope that the newly generated speech is correct. If not, then keep repeating until fixed. I set the retry max value to 3 as a default but you can increase it to higher value. If you have a lot of issues using your own fine tuned model, that means that you probably over trained it. I don't see the need to train a voice because with my settings it should be good enough. If it is important to sound exactly as the original, then you need to install RVC and train a voice for RVC and then run RVC on the generated file that you got from tortoise, and it will be 100% like the original. I hope this makes sense.

On Sat, Apr 13, 2024 at 1:15 PM jjsmcneil1113 @.***> wrote:

WOW, THANKYOU MAN!

I myself have some work to finish first, but I can't wait to try it out.

I did watch your video. When you say voice latent file, does this solve the problem of words being repeated or ends of phrases being clipped?

THANKS AGAIN!!!

— Reply to this email directly, view it on GitHub https://github.com/georgecsaszargit/tortoise_audio_book_creator/issues/1#issuecomment-2053748201, or unsubscribe https://github.com/notifications/unsubscribe-auth/BDVQGYNWMBVCZBEHYT6I5K3Y5GG4NAVCNFSM6AAAAABEHTVDJ2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJTG42DQMRQGE . You are receiving this because you commented.Message ID: @.*** com>

jjsmcneil1113 commented 5 months ago

Hi, Thanks so much again for publishing and even testing your setup.

I think I got most of installation you suggested right. I was able to fire up your audiobook app and successfully generated a sample audio output from the "random" voice. When I switched to another voice, I got the following error:

File "/home/user/miniconda3/envs/tortoiseaudiobook/lib/python3.9/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 552, in _run_script exec(code, module.dict) File "/home/user/tortoise_audio_book_creator/scripts/app.py", line 872, in main() File "/home/user/tortoise_audio_book_creator/scripts/app.py", line 867, in main start_process() File "/home/user/tortoise_audio_book_creator/scripts/app.py", line 784, in start_process filepaths = infer_on_texts( File "/home/user/tortoise_audio_book_creator/tortoise/inference.py", line 491, in infer_on_texts cust_generate2(text, max_self_correcting_rounds) File "/home/user/tortoise_audio_book_creator/tortoise/inference.py", line 407, in cust_generate2 current_chunk = custom_generate(call_tts,text,my_seed,line_p,return_deterministic_state,voicefixer) File "/home/user/tortoise_audio_book_creator/tortoise/inference.py", line 370, in custom_generate current_chunk = run_and_save_tts( File "/home/user/tortoise_audio_book_creator/tortoise/inference.py", line 151, in run_and_save_tts gen, dbg = call_tts(text,newseed) File "/home/user/tortoise_audio_book_creator/scripts/app.py", line 669, in call_tts return tts.tts_with_preset( File "/home/user/tortoise_audio_book_creator/tortoise/api.py", line 536, in tts_with_preset return self.tts(text,**settings) File "/home/user/tortoise_audio_book_creator/tortoise/api.py", line 633, in tts ) = self.get_conditioning_latents( File "/home/user/tortoise_audio_book_creator/tortoise/api.py", line 399, in get_conditioning_latents auto_conds.append(format_conditioning(ls[0], device=self.device)) File "/home/user/tortoise_audio_book_creator/tortoise/api.py", line 79, in format_conditioning

one thing, at the end of your instructions, your instructed to clone your model pth files from huggingface with a git clone command which seemed to create another "tortoise-audio-book-creator" subdirectory within your "tortoise-audio-book-creator" repository parent directory of the same name. is this correct? then your instructions said to fire up your app with a streamlit run scripts/app.py command, but this initially failed because the scripts subdir was one level up from the "huggingface" subdir.

I was wondering where I might copy my own pth voice models?

Hate to trouble you, but greatly appreciate any help!

jjsmcneil1113 commented 5 months ago

by the way, my hardware is RTX4090, AMD 7950

georgecsaszargit commented 5 months ago

Hi!

Sorry for the delay again. The issue is that it only works with RTX3090 for now. When I tried to install it on my 4090, it shows me the same error. I have spent a lot of time trying to figure out what the problem is without any luck until today. It seems I made it work, but I still need to fine tune the steps for you. I will contact you shortly.

Cheers

On Sun, Apr 14, 2024 at 10:08 PM jjsmcneil1113 @.***> wrote:

Hi, Thanks so much again for publishing and even testing your setup.

I think I got most of installation you suggested right. I was able to fire up your audiobook app and successfully generated a sample audio output from the "random" voice. When I switched to another voice, I got the following error:

File "/home/user/miniconda3/envs/tortoiseaudiobook/lib/python3.9/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 552, in _run_script exec(code, module.dict) File "/home/user/tortoise_audio_book_creator/scripts/app.py", line 872, in main() File "/home/user/tortoise_audio_book_creator/scripts/app.py", line 867, in main start_process() File "/home/user/tortoise_audio_book_creator/scripts/app.py", line 784, in start_process filepaths = infer_on_texts( File "/home/user/tortoise_audio_book_creator/tortoise/inference.py", line 491, in infer_on_texts cust_generate2(text, max_self_correcting_rounds) File "/home/user/tortoise_audio_book_creator/tortoise/inference.py", line 407, in cust_generate2 current_chunk = custom_generate(call_tts,text,my_seed,line_p,return_deterministic_state,voicefixer) File "/home/user/tortoise_audio_book_creator/tortoise/inference.py", line 370, in custom_generate current_chunk = run_and_save_tts( File "/home/user/tortoise_audio_book_creator/tortoise/inference.py", line 151, in run_and_save_tts gen, dbg = call_tts(text,newseed) File "/home/user/tortoise_audio_book_creator/scripts/app.py", line 669, in call_tts return tts.tts_with_preset( File "/home/user/tortoise_audio_book_creator/tortoise/api.py", line 536, in tts_with_preset return self.tts(text,**settings) File "/home/user/tortoise_audio_book_creator/tortoise/api.py", line 633, in tts ) = self.get_conditioning_latents( File "/home/user/tortoise_audio_book_creator/tortoise/api.py", line 399, in get_conditioning_latents auto_conds.append(format_conditioning(ls[0], device=self.device)) File "/home/user/tortoise_audio_book_creator/tortoise/api.py", line 79, in format_conditioning

one thing, at the end of your instructions, your instructed to clone your model pth files from huggingface with a git clone command which seemed to create another "tortoise-audio-book-creator" subdirectory within your "tortoise-audio-book-creator" repository parent directory of the same name. is this correct? then your instructions said to fire up your app with a streamlit run scripts/app.py command, but this initially failed because the scripts subdir was one level up from the "huggingface" subdir.

I was wondering where I might copy my own pth voice models?

Hate to trouble you, but greatly appreciate any help!

— Reply to this email directly, view it on GitHub https://github.com/georgecsaszargit/tortoise_audio_book_creator/issues/1#issuecomment-2055199242, or unsubscribe https://github.com/notifications/unsubscribe-auth/BDVQGYLPXGUEZ2STNPVRECLY5NOENAVCNFSM6AAAAABEHTVDJ2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJVGE4TSMRUGI . You are receiving this because you commented.Message ID: @.*** com>

georgecsaszargit commented 5 months ago

OK, I updated the installation steps. If you want to try to fix your already existing installation do this: pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 pip install openai-whisper==20231117

It will show some error messages about torchaudio conflicts but I didn't see any issues while generating speech.

Take care

On Sun, Apr 14, 2024 at 10:08 PM jjsmcneil1113 @.***> wrote:

Hi, Thanks so much again for publishing and even testing your setup.

I think I got most of installation you suggested right. I was able to fire up your audiobook app and successfully generated a sample audio output from the "random" voice. When I switched to another voice, I got the following error:

File "/home/user/miniconda3/envs/tortoiseaudiobook/lib/python3.9/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 552, in _run_script exec(code, module.dict) File "/home/user/tortoise_audio_book_creator/scripts/app.py", line 872, in main() File "/home/user/tortoise_audio_book_creator/scripts/app.py", line 867, in main start_process() File "/home/user/tortoise_audio_book_creator/scripts/app.py", line 784, in start_process filepaths = infer_on_texts( File "/home/user/tortoise_audio_book_creator/tortoise/inference.py", line 491, in infer_on_texts cust_generate2(text, max_self_correcting_rounds) File "/home/user/tortoise_audio_book_creator/tortoise/inference.py", line 407, in cust_generate2 current_chunk = custom_generate(call_tts,text,my_seed,line_p,return_deterministic_state,voicefixer) File "/home/user/tortoise_audio_book_creator/tortoise/inference.py", line 370, in custom_generate current_chunk = run_and_save_tts( File "/home/user/tortoise_audio_book_creator/tortoise/inference.py", line 151, in run_and_save_tts gen, dbg = call_tts(text,newseed) File "/home/user/tortoise_audio_book_creator/scripts/app.py", line 669, in call_tts return tts.tts_with_preset( File "/home/user/tortoise_audio_book_creator/tortoise/api.py", line 536, in tts_with_preset return self.tts(text,**settings) File "/home/user/tortoise_audio_book_creator/tortoise/api.py", line 633, in tts ) = self.get_conditioning_latents( File "/home/user/tortoise_audio_book_creator/tortoise/api.py", line 399, in get_conditioning_latents auto_conds.append(format_conditioning(ls[0], device=self.device)) File "/home/user/tortoise_audio_book_creator/tortoise/api.py", line 79, in format_conditioning

one thing, at the end of your instructions, your instructed to clone your model pth files from huggingface with a git clone command which seemed to create another "tortoise-audio-book-creator" subdirectory within your "tortoise-audio-book-creator" repository parent directory of the same name. is this correct? then your instructions said to fire up your app with a streamlit run scripts/app.py command, but this initially failed because the scripts subdir was one level up from the "huggingface" subdir.

I was wondering where I might copy my own pth voice models?

Hate to trouble you, but greatly appreciate any help!

— Reply to this email directly, view it on GitHub https://github.com/georgecsaszargit/tortoise_audio_book_creator/issues/1#issuecomment-2055199242, or unsubscribe https://github.com/notifications/unsubscribe-auth/BDVQGYLPXGUEZ2STNPVRECLY5NOENAVCNFSM6AAAAABEHTVDJ2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJVGE4TSMRUGI . You are receiving this because you commented.Message ID: @.*** com>

jjsmcneil1113 commented 5 months ago

Thanks so much George! I really appreciate it. I will give it a shot.

I don't know if you work in the deep learning field, but I found some intriguing TTS work being done that may solve some of the issues inherent in the discrete tokenization and autoregressive models such as tortoise TTS, which as we know is prone to error propagation and thus unstable speech output (a.k.a. unstable prosody, word skipping/repetition).Have you heard of this work from a Microsoft group in China https://speechresearch.github.io/ who has developed Natural speech 2, and more recently Natural speech 3? They use continuous vector representations (rather than discrete tokens) and non-autoregressive/diffusion. they have released only a portion of their code (Natural speech 3 neural speech codec called FACodec). I don't know if you have any interest in trying perhaps a better TTS model? besides their huggingface repo https://huggingface.co/spaces/amphion/naturalspeech3_facodec

there is an attempt to implement natural speech 2 at https://github.com/lucidrains/naturalspeech2-pytorch

jjsmcneil1113 commented 5 months ago

All right, success! George once again, GREATLY APPRECIATE all your time to make this available. Hate to pester with more questions, but whenever you have time, a few questions on how to use it: 1) i cloned some voices using Jarod Mica tortoise TTS cloning repo. can I use the fine-tuned cloned voice .PTH files in your app? I tried, and it gave me a key error, but I'm not sure if I need to place them in a different directory. I was able to use my own voice conditional latent files successfully, when I placed the conditional latent .pth file in the voice dir. 2) you mentioned using RVC voice changing to get the audio even closer to the original speaker quality. Am I correct, I would need to do this after your app is done, and use another RVC app to do this? 3) i tried to enter an absolute path for a 6 sec .wav file in "reference pitch file path" in order to self correct for output voice pitch errors. but when I tried this, every single line in the book showed pitch errors, including all 3 redos when it was trying to self correct. i left the options in your app such as "pitch diff threshold" to default values. 4) do you know of a way to get tortoise tts to spell out the letters in an acronym rather than try to pronounce the acronym as a word?

THANKS!

georgecsaszargit commented 5 months ago

I am happy that it worked. 1) Yes you can use them. You just have to place the pth file inside the tortoise_audio_book_creator/models/finetuned folder (As I said earlier, on 4090 it will show a warning for torchaudio but it works for me. I just tested it) 2) Yes, you train RVC with the same dataset that you trained your tortoise model and once you have your generated file from tortoise in the results folder, you take that as a source for RVC and you run it through, It will make the voice match 100% 3) The wav file should be directly inside tortoise_audio_book_creator folder. You need to fine-tune the threshold settings for your voice. Start with a low settings and keep jumping up by 10, until you will see that it will catch only the pitch issues. It is a trial and error kind of thing because each voice has a different tone. Then just remember which voice has which settings 4) The only way I know (what I do) is to run the text through chatgpt 3.5 using their API with specific instructions to prepare the text. Change acronyms, etc. It is just a couple of cents and it is worth the trouble. Unfortunately I haven't seen a local llm that does a good job of this. The new llama3 model that just came out might be able to do it, because it is pretty powerful, but it is really unnecessary since chatgpt 3.5 is cheap. If you want to try llama3 local model check out llamafile github project. It is amazingly simple and powerful!

Good luck Cheers

On Sat, Apr 27, 2024 at 9:43 AM jjsmcneil1113 @.***> wrote:

All right, success! George once again, GREATLY APPRECIATE all your time to make this available. Hate to pester with more questions, but whenever you have time, a few questions on how to use it:

  1. i cloned some voices using Jarod Mica tortoise TTS cloning repo. can I use the fine-tuned cloned voice .PTH files in your app? I tried, and it gave me a key error, but I'm not sure if I need to place them in a different directory. I was able to use my own voice conditional latent files successfully, when I placed the conditional latent .pth file in the voice dir.
  2. you mentioned using RVC voice changing to get the audio even closer to the original speaker quality. Am I correct, I would need to do this after your app is done, and use another RVC app to do this?
  3. i tried to enter an absolute path for a 6 sec .wav file in "reference pitch file path" in order to self correct for output voice pitch errors. but when I tried this, every single line in the book showed pitch errors, including all 3 redos when it was trying to self correct. i left the options in your app such as "pitch diff threshold" to default values.
  4. do you know of a way to get tortoise tts to spell out the letters in an acronym rather than try to pronounce the acronym as a word?

THANKS!

— Reply to this email directly, view it on GitHub https://github.com/georgecsaszargit/tortoise_audio_book_creator/issues/1#issuecomment-2081058947, or unsubscribe https://github.com/notifications/unsubscribe-auth/BDVQGYI2RZRNQ23TJ3KGJP3Y7PIR7AVCNFSM6AAAAABEHTVDJ2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBRGA2TQOJUG4 . You are receiving this because you commented.Message ID: @.*** com>

georgecsaszargit commented 5 months ago

It looks very promising and I would love to try it but I could not find an actual code that I can download and test. Let me know if you hear some updates. I think I tried every single TTS engine and I personally think that tortoise is the best by far so far. I even feel that a properly set up model can be better that Eleven Labs, but less stable of course. I am always on the lookout for new developments and curious what is about to come with the new Ai developments.

Thank you George

On Fri, Apr 26, 2024 at 4:21 PM jjsmcneil1113 @.***> wrote:

Thanks so much George! I really appreciate it. I will give it a shot.

I don't know if you work in the deep learning field, but I found some intriguing TTS work being done that may solve some of the issues inherent in the discrete tokenization and autoregressive models such as tortoise TTS, which as we know is prone to error propagation and thus unstable speech output (a.k.a. unstable prosody, word skipping/repetition).Have you heard of this work from a Microsoft group in China https://speechresearch.github.io/ who has developed Natural speech 2, and more recently Natural speech 3? They use continuous vector representations (rather than discrete tokens) and non-autoregressive/diffusion. they have released only a portion of their code (Natural speech 3 neural speech codec called FACodec). I don't know if you have any interest in trying perhaps a better TTS model? besides their huggingface repo https://huggingface.co/spaces/amphion/naturalspeech3_facodec

there is an attempt to implement natural speech 2 at https://github.com/lucidrains/naturalspeech2-pytorch

— Reply to this email directly, view it on GitHub https://github.com/georgecsaszargit/tortoise_audio_book_creator/issues/1#issuecomment-2080227312, or unsubscribe https://github.com/notifications/unsubscribe-auth/BDVQGYJUHUKSTCQYJ5J2LUTY7LOPZAVCNFSM6AAAAABEHTVDJ2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBQGIZDOMZRGI . You are receiving this because you commented.Message ID: @.*** com>

jjsmcneil1113 commented 5 months ago

thanks again George. I really like what you did in implementing auto correct features. I am finding though that I am unable to get rid of occasional wild variations in speech rate. there will be times where a single word is slurred over five seconds. I am considering two different approaches to this problem. 1) writing a script to dynamically calculate speech rate on a word by word basis using whisperx ability to give word by word timings in its transcripts, and then comparing the speech rate variations to reference audio. 2) if auto correct trials fail, use a more stable tts system like coqui tts to produces audio for the troublesome segments only. tortoise tts tends to have problems with shorter text like section headings or chapter titles in books, but it can still screw up on longer sentences too. i was planning on trying to fork off your repo and try myself but was wondering if you had already tried these

georgecsaszargit commented 5 months ago

Yes, I tried it. Let me explain why I stick with the basic autoregressive model instead of training my own, which I've attempted numerous times. No matter the duration of the training, I consistently encounter more issues during inference. Just as you mentioned, the anomalies increase. While the output increasingly resembles the original voice, the number of errors becomes overwhelming. Therefore, when I considered the same alternative you suggested, I was using the coqui XTTS v2, it simply didn't deliver satisfactory results and it was too different from the original generation. If these issues only happened sometimes, it might be manageable, but their constant occurrence completely ruined my experience. Additionally, when I developed a script to pinpoint and timestamp the problematic sections of the sentences, I faced challenges with inaccurate cuts. Initially, I created my own script, then I switched to using https://github.com/linto-ai/whisper-timestamped.git, but I encountered the same issues. I honestly think, as sad as it is, that we just have to wait until a better opensource alternative will become available. Let me know if you came up with a good solution.

thanks

On Sat, May 4, 2024 at 9:32 AM jjsmcneil1113 @.***> wrote:

thanks again George. I really like what you did in implementing auto correct features. I am finding though that I am unable to get rid of occasional wild variations in speech rate. there will be times where a single word is slurred over five seconds. I am considering two different approaches to this problem. 1) writing a script to dynamically calculate speech rate on a word by word basis using whisperx ability to give word by word timings in its transcripts, and then comparing the speech rate variations to reference audio. 2) if auto correct trials fail, use a more stable tts system like coqui tts to produces audio for the troublesome segments only. tortoise tts tends to have problems with shorter text like section headings or chapter titles in books, but it can still screw up on longer sentences too. i was planning on trying to fork off your repo and try myself but was wondering if you had already tried these

— Reply to this email directly, view it on GitHub https://github.com/georgecsaszargit/tortoise_audio_book_creator/issues/1#issuecomment-2094291573, or unsubscribe https://github.com/notifications/unsubscribe-auth/BDVQGYN56IA5YAJOID2KY4LZAUES7AVCNFSM6AAAAABEHTVDJ2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJUGI4TCNJXGM . You are receiving this because you commented.Message ID: @.*** com>

jjsmcneil1113 commented 5 months ago

I am actually using voice latent file that I obtain by clicking "(Re)compute voice latents" in jarod mica's AI voice cloning repo, https://github.com/JarodMica/ai-voice-cloning . i basically copy the "cond_latents_d1f79232.pth" file and place it in a subdir with name of the voice in "tortoise_audio_book_creator\tortoise\voices". I am not using a trained voice or fine tuned model. do you have better luck using another way to obtain voice latent file? When using the "cond_latents_d1f79232.pth", I still get cluster of words that are weirdly slurred over several seconds as if severely inebriated, which pitch diff or word/char diff auto-correction does not capture. my source audio.wav files are good quality, 90 mins total time, split into segments under 10 sec each. Do you get this problem too, even after optimal auto-corrections? I found another TTS that I am thinking of trying https://github.com/metavoiceio/metavoice-src they are a company that seems to have open sourced their TTS. https://ttsdemo.themetavoice.xyz/ THANKS!

jjsmcneil1113 commented 5 months ago

by the way, i am thinking of using RVC to fix problems where other TTS output (such as Coqui) are not close enough to the original. did you try that too?

georgecsaszargit commented 5 months ago

Oh, I see. In that case, it depends on the latent file or voice. You are doing it correctly based on what you told me. (I don't split up the voice files though, only for model training.) I have some voices that generate a lot of errors and some that don't. If you send me the voice samples I could take a look. When the voice is good I have no issues what you described only very rarely. I have tried metavoice, but it is not up to my standards unfortunately.

On Sun, May 5, 2024 at 9:09 AM jjsmcneil1113 @.***> wrote:

I am actually using voice latent file that I obtain by clicking "(Re)compute voice latents" in jarod mica's AI voice cloning repo, https://github.com/JarodMica/ai-voice-cloning . i basically copy the "cond_latents_d1f79232.pth" file and place it in a subdir with name of the voice in "tortoise_audio_book_creator\tortoise\voices". I am not using a trained voice or fine tuned model. do you have better luck using another way to obtain voice latent file? When using the "cond_latents_d1f79232.pth", I still get cluster of words that are weirdly slurred over several seconds as if severely inebriated, which pitch diff or word/char diff auto-correction does not capture. my source audio.wav files are good quality, 90 mins total time, split into segments under 10 sec each. Do you get this problem too, even after optimal auto-corrections? I found another TTS that I am thinking of trying https://github.com/metavoiceio/metavoice-src they are a company that seems to have open sourced their TTS. https://ttsdemo.themetavoice.xyz/ THANKS!

— Reply to this email directly, view it on GitHub https://github.com/georgecsaszargit/tortoise_audio_book_creator/issues/1#issuecomment-2094863359, or unsubscribe https://github.com/notifications/unsubscribe-auth/BDVQGYLMW3SFCBVDCGG43XDZAZKTTAVCNFSM6AAAAABEHTVDJ2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJUHA3DGMZVHE . You are receiving this because you commented.Message ID: @.*** com>

georgecsaszargit commented 5 months ago

I haven't implemented this setup yet, but I can't imagine it working well because Coqui lacks interesting and varied intonation, even though it produces pretty good quality results. RVC will make the voice sound closer to the original, but it won't incorporate changes in intonation such as Tortoise can do.

Message ID: @.*** com>

jjsmcneil1113 commented 4 months ago

Hey George, do you have a background in machine learning? I actually did bachelor in CS long time ago, not machine learning, but I think I could pick some of it up on the way. do you want to collaborate on building an implementation of natural speech 2? https://speechresearch.github.io/naturalspeech2/ some others have already started in this repo https://github.com/lucidrains/naturalspeech2-pytorch but it kind of fizzled

georgecsaszargit commented 4 months ago

Hi,

Sorry for the delay. No, unfortunately, I don't have any background in machine learning. I wish I would have. Also, I have time issues that would not allow me to, but thanks anyways.

Cheers

On Fri, May 10, 2024 at 11:53 AM jjsmcneil1113 @.***> wrote:

Hey George, do you have a background in machine learning? I actually did bachelor in CS long time ago, not machine learning, but I think I could pick some of it up on the way. do you want to collaborate on building an implementation of natural speech 2? https://speechresearch.github.io/naturalspeech2/ some others have already started in this repo https://github.com/lucidrains/naturalspeech2-pytorch but it kind of fizzled

— Reply to this email directly, view it on GitHub https://github.com/georgecsaszargit/tortoise_audio_book_creator/issues/1#issuecomment-2105113467, or unsubscribe https://github.com/notifications/unsubscribe-auth/BDVQGYLGMXZ7HU4GDBQBCGTZBUJRJAVCNFSM6AAAAABEHTVDJ2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMBVGEYTGNBWG4 . You are receiving this because you commented.Message ID: @.*** com>

jjsmcneil1113 commented 4 months ago

Hi George. NO problem. I spent some time trying to get auto correct methods using myprosody library on GitHub and got some initial success, but then I found out about chat gpt own tts solution which is light-years better than tortoise tts. I can't figure it out though how much chat gpt charges and whether it's feasible to do audiobook

georgecsaszargit commented 3 months ago

Yes, I agree with you about the chat gpt option. I would love to put my hand on that and test it out for a book.

Thanks

On Mon, Jun 3, 2024, at 7:52 PM, David Choi wrote:

Hi George. NO problem. I spent some time trying to get auto correct methods using myprosody library on GitHub and got some initial success, but then I found out about chat gpt own tts solution which is light-years better than tortoise tts. I can't figure it out though how much chat gpt charges and whether it's feasible to do audiobook

— Reply to this email directly, view it on GitHub https://github.com/georgecsaszargit/tortoise_audio_book_creator/issues/1#issuecomment-2146471075, or unsubscribe https://github.com/notifications/unsubscribe-auth/BDVQGYNGK2IJ3ISQRSTDUILZFUTWBAVCNFSM6AAAAABEHTVDJ2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNBWGQ3TCMBXGU. You are receiving this because you commented.Message ID: @.***>

jjsmcneil1113 commented 3 months ago

https://platform.openai.com/docs/guides/text-to-speech

the quality is much better than tortoise, but from what I could find cost is too much for me

TTS Usage $15.00 / 1M characters

TTS HD Usage $30.00 / 1M characters

please let me know if you find cheaper pricing for chat gpt tts

georgecsaszargit commented 3 months ago

This is so weird. I literally just looked this information up an hour ago and now you sent me an email about it. Weird coincidence :)) Yeah, this is pretty pricey. My maximum would be if the HD voice would be around 5-10 dollars for 1M chars. Since I only use TTS for audiobooks, it won't cut it for me, but I guess others use it for other purposes.

Thanks for the info!

On Sat, Jun 15, 2024, at 5:01 AM, David Choi wrote:

https://platform.openai.com/docs/guides/text-to-speech

the quality is much better than tortoise, but from what I could find cost is too much for me

TTS Usage $15.00 / 1M characters

TTS HD Usage $30.00 / 1M characters

please let me know if you find cheaper pricing for chat gpt tts

— Reply to this email directly, view it on GitHub https://github.com/georgecsaszargit/tortoise_audio_book_creator/issues/1#issuecomment-2169395955, or unsubscribe https://github.com/notifications/unsubscribe-auth/BDVQGYMLHQRKBVCLPT5PYM3ZHQUK3AVCNFSM6AAAAABEHTVDJ2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRZGM4TKOJVGU. You are receiving this because you commented.Message ID: @.***>