DrewThomasson / ebook2audiobook

Generates an audiobook with chapters and ebook metadata using Calibre and Xtts from Coqui tts, and with optional voice cloning, and supports multiple languages
MIT License
795 stars 78 forks source link

Chinese text conversion takes a long time and has character limitations, resulting in audio being truncated #18

Closed WUYIN66 closed 1 month ago

WUYIN66 commented 1 month ago

A 6000 character text file that takes over an hour to convert to audio

image It will always be stuck at 30% progress. Is there any good solution? Or was my operation incorrect?

DrewThomasson commented 1 month ago

That loading bar is extremely basic, because It's split into only 3 steps

For accurate watching of progress I would look at your terminal

DrewThomasson commented 1 month ago

ALSO YES!

I know about the truncated audio and character limitations for Chinese text, I've seen that error before but I have never been able to apply a fix as I don't know what symbols in Chinese text are used for pauses in speaking.

If you would supply me the set of characters that tell when the speaker is pausing when speaking I will gladly apply a fix to it

DrewThomasson commented 1 month ago

Also the program will always run very slowly on only cpu :(

The only known way for you to speed up the audio generation speed would be to use a Nvida CUDA capable GPU with >= 4gb VRAM

I'll research more into XTTS and see what I can find for increasing the speed without changing the hardware though

DrewThomasson commented 1 month ago

I'll be looking into the XTTS docs this week then to add a way to allow the user to mess with these settings in the web GUI or in headless mode.

seems there's ways to speed up the audio generation on cpu at the cost of audio quality

https://docs.coqui.ai/en/latest/models/xtts.html#

WUYIN66 commented 1 month ago

In Chinese, these symbols are used for pauses in speech.("。"、“,”、“、”、“:”、“;”). I will continue to monitor this project and hope you can come up with a solution.

DrewThomasson commented 1 month ago

Thank you so much, getting those symbols has helped me so much

Applied fix, and updated the docker image:

Input txt: long_chinese.txt

Output Audio: https://github.com/user-attachments/assets/088e3890-76e2-4cd6-9971-6261e5b7ebc0

🤔 Does that output Audio sound correct? 🤔

WUYIN66 commented 1 month ago

Hello, I have listened to the audio completely and compared it with this document. I found that there are some places where separators pause, some places do not pause, and some words may appear to be read repeatedly.

DrewThomasson commented 1 month ago

OK thats good then!

I also just implemented a way to mess with the parameters for xtts in the gui and in headless mode, this should help out

The added parameters are these for headless and in gui mode.

  1. --temperature: Controls the randomness of the model's output. Higher values (e.g., >1) result in more creative, diverse, and sometimes nonsensical outputs, while lower values (e.g., <1) make the output more deterministic and focused. The default is 0.65.

  2. --length_penalty: Adjusts the importance of output length. A value greater than 1 discourages longer sequences, while a value below 1 encourages longer outputs. The default is 1.0, meaning no length preference.

  3. --repetition_penalty: Penalizes repeated phrases in the model's output. Higher values (e.g., >1) discourage repetition more strongly, while lower values are more lenient. The default is 2.0, meaning repetition is heavily discouraged.

  4. --top_k: Limits the model to considering only the top k most probable next words during text generation. Lower values lead to more deterministic outputs. A value of 50 means the model samples from the top 50 tokens at each step.

  5. --top_p: Implements nucleus sampling, where the model only considers the most probable set of tokens whose combined probability is p. A value of 0.8 means the model samples from tokens that cumulatively account for 80% of the probability.

  6. --speed: Adjusts the speed at which the speech is generated. A value of 1.0 is normal speed, while higher values make the speech faster and lower values slow it down.

  7. --enable_text_splitting: If set to True, this argument enables splitting the input text into sentences for more controlled processing. Defaults to False.

You can see the added parameters here in the gui

image

For headless mode this Is what parameters now show up for python app.py -h

usage: app.py [-h] [--share SHARE] [--headless HEADLESS] [--ebook EBOOK] [--voice VOICE]
              [--language LANGUAGE] [--use_custom_model USE_CUSTOM_MODEL]
              [--custom_model CUSTOM_MODEL] [--custom_config CUSTOM_CONFIG]
              [--custom_vocab CUSTOM_VOCAB] [--custom_model_url CUSTOM_MODEL_URL]
              [--temperature TEMPERATURE] [--length_penalty LENGTH_PENALTY]
              [--repetition_penalty REPETITION_PENALTY] [--top_k TOP_K] [--top_p TOP_P]
              [--speed SPEED] [--enable_text_splitting ENABLE_TEXT_SPLITTING]

Convert eBooks to Audiobooks using a Text-to-Speech model. You can either launch the
Gradio interface or run the script in headless mode for direct conversion.

options:
  -h, --help            show this help message and exit
  --share SHARE         Set to True to enable a public shareable Gradio link. Defaults
                        to False.
  --headless HEADLESS   Set to True to run in headless mode without the Gradio
                        interface. Defaults to False.
  --ebook EBOOK         Path to the ebook file for conversion. Required in headless
                        mode.
  --voice VOICE         Path to the target voice file for TTS. Optional, uses a default
                        voice if not provided.
  --language LANGUAGE   Language for the audiobook conversion. Options: en, es, fr, de,
                        it, pt, pl, tr, ru, nl, cs, ar, zh-cn, ja, hu, ko. Defaults to
                        English (en).
  --use_custom_model USE_CUSTOM_MODEL
                        Set to True to use a custom TTS model. Defaults to False. Must
                        be True to use custom models, otherwise you'll get an error.
  --custom_model CUSTOM_MODEL
                        Path to the custom model file (.pth). Required if using a custom
                        model.
  --custom_config CUSTOM_CONFIG
                        Path to the custom config file (config.json). Required if using
                        a custom model.
  --custom_vocab CUSTOM_VOCAB
                        Path to the custom vocab file (vocab.json). Required if using a
                        custom model.
  --custom_model_url CUSTOM_MODEL_URL
                        URL to download the custom model as a zip file. Optional, but
                        will be used if provided. Examples include David Attenborough's
                        model: 'https://huggingface.co/drewThomasson/xtts_David_Attenbor
                        ough_fine_tune/resolve/main/Finished_model_files.zip?download=tr
                        ue'. More XTTS fine-tunes can be found on my Hugging Face at
                        'https://huggingface.co/drewThomasson'.
  --temperature TEMPERATURE
                        Temperature for the model. Defaults to 0.65. Higher Tempatures
                        will lead to more creative outputs IE: more Hallucinations.
                        Lower Tempatures will be more monotone outputs IE: less
                        Hallucinations.
  --length_penalty LENGTH_PENALTY
                        A length penalty applied to the autoregressive decoder. Defaults
                        to 1.0.
  --repetition_penalty REPETITION_PENALTY
                        A penalty that prevents the autoregressive decoder from
                        repeating itself. Defaults to 2.0.
  --top_k TOP_K         Top-k sampling. Lower values mean more likely outputs and
                        increased audio generation speed. Defaults to 50.
  --top_p TOP_P         Top-p sampling. Lower values mean more likely outputs and
                        increased audio generation speed. Defaults to 0.8.
  --speed SPEED         Speed factor for the speech generation. IE: How fast the
                        Narrerator will speak. Defaults to 1.0.
  --enable_text_splitting ENABLE_TEXT_SPLITTING
                        Enable splitting text into sentences. Defaults to True.

Example: python script.py --headless --ebook path_to_ebook --voice path_to_voice
--language en --use_custom_model True --custom_model model.pth --custom_config
config.json --custom_vocab vocab.json
DrewThomasson commented 1 month ago

Hopefully those added XTTS Fine-Controls should help:

I'll continue working on fixing your other issue of:

"there are some places where separators pause, some places do not pause."

KortanZ commented 1 month ago

I've check the latest docker logs, seems before Using model:xtts takes really long time no matter which language has been chosen.

for root_file in tree.findall('//xmlns:rootfile[@media-type]', namespaces={'xmlns': NAMESPACES['CONTAINERNS']}):      Saved chapter: ./Working_files/temp_ebook/chapter_0.txt
> tts_models/multilingual/multi-dataset/xtts_v2 is already downloaded.
// --- takes long time here ---
> Using model: xtts 
DrewThomasson commented 1 month ago

That's to be expected as ebook2audiobookxtts uses the same model for all supported languages.

That step is just the script doing the initial first step of loading the XTTS model into memory to use.

Once it's loaded in the script should move at a consistent pace.

WUYIN66 commented 1 month ago

I wrote a Chinese readme, but excluding 'Common Issues' and its subsequent sections. image

WUYIN66 commented 1 month ago

I ran the app.py file to convert Chinese text, but encountered an error during the conversion process. which seems to be an encoding issue. image the file is: testq1.txt

DrewThomasson commented 1 month ago

To Chinese readme comment

https://github.com/DrewThomasson/ebook2audiobookXTTS/issues/18#issuecomment-2404330803

About the encoding issue:

DrewThomasson commented 1 month ago

Also

Update:

Your_Input_file:

testq1.txt

Output audio from the free Google Colab

testq1.m4b.zip

Link to Google Colab Notebook I used

Google Colab Notebook

Update 2:

huggingface space

WUYIN66 commented 1 month ago

About the encoding error

I tried again and found that the work was good. The previous error may have been due to my environment, as I was running it locally.

To Chinese readme

I have request a merge.

DrewThomasson commented 1 month ago

I'll hit you up when I have another Chinese text question