Closed WUYIN66 closed 1 month ago
That loading bar is extremely basic, because It's split into only 3 steps
For accurate watching of progress I would look at your terminal
ALSO YES!
I know about the truncated audio and character limitations for Chinese text, I've seen that error before but I have never been able to apply a fix as I don't know what symbols in Chinese text are used for pauses in speaking.
If you would supply me the set of characters that tell when the speaker is pausing when speaking I will gladly apply a fix to it
Also the program will always run very slowly on only cpu :(
The only known way for you to speed up the audio generation speed would be to use a Nvida CUDA capable GPU with >= 4gb VRAM
I'll research more into XTTS and see what I can find for increasing the speed without changing the hardware though
I'll be looking into the XTTS docs this week then to add a way to allow the user to mess with these settings in the web GUI or in headless mode.
seems there's ways to speed up the audio generation on cpu at the cost of audio quality
In Chinese, these symbols are used for pauses in speech.("。"、“,”、“、”、“:”、“;”). I will continue to monitor this project and hope you can come up with a solution.
Input txt: long_chinese.txt
Output Audio: https://github.com/user-attachments/assets/088e3890-76e2-4cd6-9971-6261e5b7ebc0
Hello, I have listened to the audio completely and compared it with this document. I found that there are some places where separators pause, some places do not pause, and some words may appear to be read repeatedly.
temperature
should reduce any hallucinations or repeating of wordstop_k
and the top_p
seems to increase the generation speed, even on just CPUrepetition_penalty
--temperature
: Controls the randomness of the model's output. Higher values (e.g., >1) result in more creative, diverse, and sometimes nonsensical outputs, while lower values (e.g., <1) make the output more deterministic and focused. The default is 0.65.
--length_penalty
: Adjusts the importance of output length. A value greater than 1 discourages longer sequences, while a value below 1 encourages longer outputs. The default is 1.0, meaning no length preference.
--repetition_penalty
: Penalizes repeated phrases in the model's output. Higher values (e.g., >1) discourage repetition more strongly, while lower values are more lenient. The default is 2.0, meaning repetition is heavily discouraged.
--top_k
: Limits the model to considering only the top k
most probable next words during text generation. Lower values lead to more deterministic outputs. A value of 50 means the model samples from the top 50 tokens at each step.
--top_p
: Implements nucleus sampling, where the model only considers the most probable set of tokens whose combined probability is p
. A value of 0.8 means the model samples from tokens that cumulatively account for 80% of the probability.
--speed
: Adjusts the speed at which the speech is generated. A value of 1.0 is normal speed, while higher values make the speech faster and lower values slow it down.
--enable_text_splitting
: If set to True
, this argument enables splitting the input text into sentences for more controlled processing. Defaults to False
.
python app.py -h
usage: app.py [-h] [--share SHARE] [--headless HEADLESS] [--ebook EBOOK] [--voice VOICE]
[--language LANGUAGE] [--use_custom_model USE_CUSTOM_MODEL]
[--custom_model CUSTOM_MODEL] [--custom_config CUSTOM_CONFIG]
[--custom_vocab CUSTOM_VOCAB] [--custom_model_url CUSTOM_MODEL_URL]
[--temperature TEMPERATURE] [--length_penalty LENGTH_PENALTY]
[--repetition_penalty REPETITION_PENALTY] [--top_k TOP_K] [--top_p TOP_P]
[--speed SPEED] [--enable_text_splitting ENABLE_TEXT_SPLITTING]
Convert eBooks to Audiobooks using a Text-to-Speech model. You can either launch the
Gradio interface or run the script in headless mode for direct conversion.
options:
-h, --help show this help message and exit
--share SHARE Set to True to enable a public shareable Gradio link. Defaults
to False.
--headless HEADLESS Set to True to run in headless mode without the Gradio
interface. Defaults to False.
--ebook EBOOK Path to the ebook file for conversion. Required in headless
mode.
--voice VOICE Path to the target voice file for TTS. Optional, uses a default
voice if not provided.
--language LANGUAGE Language for the audiobook conversion. Options: en, es, fr, de,
it, pt, pl, tr, ru, nl, cs, ar, zh-cn, ja, hu, ko. Defaults to
English (en).
--use_custom_model USE_CUSTOM_MODEL
Set to True to use a custom TTS model. Defaults to False. Must
be True to use custom models, otherwise you'll get an error.
--custom_model CUSTOM_MODEL
Path to the custom model file (.pth). Required if using a custom
model.
--custom_config CUSTOM_CONFIG
Path to the custom config file (config.json). Required if using
a custom model.
--custom_vocab CUSTOM_VOCAB
Path to the custom vocab file (vocab.json). Required if using a
custom model.
--custom_model_url CUSTOM_MODEL_URL
URL to download the custom model as a zip file. Optional, but
will be used if provided. Examples include David Attenborough's
model: 'https://huggingface.co/drewThomasson/xtts_David_Attenbor
ough_fine_tune/resolve/main/Finished_model_files.zip?download=tr
ue'. More XTTS fine-tunes can be found on my Hugging Face at
'https://huggingface.co/drewThomasson'.
--temperature TEMPERATURE
Temperature for the model. Defaults to 0.65. Higher Tempatures
will lead to more creative outputs IE: more Hallucinations.
Lower Tempatures will be more monotone outputs IE: less
Hallucinations.
--length_penalty LENGTH_PENALTY
A length penalty applied to the autoregressive decoder. Defaults
to 1.0.
--repetition_penalty REPETITION_PENALTY
A penalty that prevents the autoregressive decoder from
repeating itself. Defaults to 2.0.
--top_k TOP_K Top-k sampling. Lower values mean more likely outputs and
increased audio generation speed. Defaults to 50.
--top_p TOP_P Top-p sampling. Lower values mean more likely outputs and
increased audio generation speed. Defaults to 0.8.
--speed SPEED Speed factor for the speech generation. IE: How fast the
Narrerator will speak. Defaults to 1.0.
--enable_text_splitting ENABLE_TEXT_SPLITTING
Enable splitting text into sentences. Defaults to True.
Example: python script.py --headless --ebook path_to_ebook --voice path_to_voice
--language en --use_custom_model True --custom_model model.pth --custom_config
config.json --custom_vocab vocab.json
"there are some places where separators pause, some places do not pause.
"
I've check the latest docker logs, seems before Using model:xtts
takes really long time no matter which language has been chosen.
for root_file in tree.findall('//xmlns:rootfile[@media-type]', namespaces={'xmlns': NAMESPACES['CONTAINERNS']}): Saved chapter: ./Working_files/temp_ebook/chapter_0.txt
> tts_models/multilingual/multi-dataset/xtts_v2 is already downloaded.
// --- takes long time here ---
> Using model: xtts
That step is just the script doing the initial first step of loading the XTTS model into memory to use.
Once it's loaded in the script should move at a consistent pace.
I wrote a Chinese readme, but excluding 'Common Issues' and its subsequent sections.
I ran the app.py file to convert Chinese text, but encountered an error during the conversion process. which seems to be an encoding issue. the file is: testq1.txt
https://github.com/DrewThomasson/ebook2audiobookXTTS/issues/18#issuecomment-2404330803
You'll have to send more info then that one line 😅
Or at least the full traceback error
88 seconds on laptop cpu
to > Processing time: 6.19757080078125 on free Google Colab GPU
I tried again and found that the work was good. The previous error may have been due to my environment, as I was running it locally.
I have request a merge.
I'll hit you up when I have another Chinese text question
A 6000 character text file that takes over an hour to convert to audio
It will always be stuck at 30% progress. Is there any good solution? Or was my operation incorrect?