character limit of 273 for language 'fr'" error

Vodou4460 commented 9 months ago

Dear aedocw,

I've been using your epub2tts script with CoquiTTS_XTTS for French language processing and encountered a couple of issues. Specifically, I frequently ran into the "character limit of 273 for language 'fr'" error and faced problems with empty data. These seemed to stem from processing text segments that were too lengthy for the XTTS system.

To address these, I experimented with two main modifications:

1. Modification of the combine_sentences Function in epub2tts.py:
Rather than combining sentences into longer segments, I tweaked this function to yield each sentence individually. This approach helps in managing the character limit more effectively. Here’s the adjusted function:

def combine_sentences(self, sentences, length=1000):
    for sentence in sentences:
        yield sentence

2. Preprocessing the Text with a Custom Function:
Additionally, I crafted a separate function for text preparation. This function employs regular expressions to split the text into sentences and then shortens each sentence to fit within the character limit. It also handles the replacement of certain characters for text cleanup.

import re
import datetime

def split_sentences(text):
    # Using regular expression to find split points
    parts = re.split(r'(?<![A-Z])([\.|\?|\!])\s', text)

    # Reconstructing sentences with punctuation characters
    sentences = []
    for i in range(0, len(parts)-1, 2):
        sentences.append(parts[i] + parts[i+1])
    print(f"Number of detected sentences: {len(sentences)}")
    return sentences

def shorten_sentences(sentences, max_length):
    new_sentences = []
    for sentence in sentences:
        while len(sentence) > max_length:
            # Finding the last punctuation mark
            cut_point = max(sentence.rfind(',', 0, max_length), sentence.rfind(';', 0, max_length))

            # If no punctuation mark is found, look for a space
            if cut_point <= 0:
                cut_point = sentence.rfind(' ', 0, max_length)

            if cut_point > 0:
                new_sentences.append(sentence[:cut_point+1].strip() + '.')
                sentence = sentence[cut_point+1:].strip()
            else:
                new_sentences.append(sentence[:max_length].strip() + '.')
                sentence = sentence[max_length:].strip()

        new_sentences.append(sentence.strip() + '.' if not sentence.endswith('.') else sentence.strip())
    print(f"Total number of sentences after shortening: {len(new_sentences)}")
    return new_sentences

def replace_characters(text, replace_chars, new_chars):
    for old, new in zip(replace_chars, new_chars):
        text = text.replace(old, new)
    return text

def save_to_file(lines, original_filename, max_length):
    now = datetime.datetime.now()
    timestamp = now.strftime("%Y-%m-%d_%H-%M-%S")
    basename = original_filename.split('.')[0]
    ext = original_filename.split('.')[1]
    new_filename = f"{timestamp}_{basename}_split_{max_length}.{ext}"
    with open(new_filename, 'w', encoding='utf-8') as new_file:
        for line in lines:
            new_file.write(line + '\n')
            print(f"Writing sentence (length {len(line)}): {line[:50]}...")
    return new_filename

def split_and_save_text_v9(original_filename, max_length=300):
    with open(original_filename, 'r') as file:
        text = file.read()

    sentences = split_sentences(text)
    short_sentences = shorten_sentences(sentences, max_length)
    modified_text = [replace_characters(sentence, ["- ", "\n\n"], ["", ".\n\n"]) for sentence in short_sentences]

    return save_to_file(modified_text, original_filename, max_length)

Functionality Explanation:

split_sentences function: This splits a given text into sentences using a regular expression. It looks for punctuation marks like ., ?, or ! and ensures that these marks are not preceded by a capital letter (to avoid splitting at abbreviations).
shorten_sentences function: It shortens sentences to a specified maximum length. If a sentence is longer than the maximum length, it looks for a suitable point to split the sentence, preferably at a comma or semicolon, or else at a space. Each new sentence is ended with a period.
replace_characters function: This replaces specified characters in the text. It's useful for cleaning up the text or ensuring consistency in formatting.
save_to_file function: This function saves the modified sentences to a new file. The new file's name includes a timestamp for easy identification. It prints part of each sentence as it's saved to provide a progress update.
split_and_save_text_v9 function: This is the main function that orchestrates the process. It reads the text from a file, splits the text into sentences, shortens the sentences if necessary, replaces certain characters, and then saves the modified sentences to a new file. The maximum sentence length can be specified, with a default value of 300 characters.

# Use this function to process your file
new_filename = split_and_save_text_v9("psychotherapie-de-la-dissociation-et-du-trauma.txt")
print(f"New file created: {new_filename}")

This preprocessing ensures that each sentence inputted into the combine_sentences function conforms to the character limits imposed by CoquiTTS_XTTS, greatly reducing errors and enhancing the text-to-speech process for French.

While my solution is not perfect and can be considered a makeshift "Bricolage," I wanted to share it with you. I believe you might find a much better solution, and I am eager to see how this can be further improved.

Best regards,

aedocw commented 9 months ago

At a quick glance, this looks really great! I will take a more careful look through all this, but it looks like it really could help when the source is a text file. Detecting sentences reliably is not easy, that's why I ended up using NLTK (natural language tool kit). I like your approach though, and will play around with it some. I would also really like to figure out how to make a good guess at separating text files into chapters so they get useful "part" splits, but other than just trying to match on CHAPTER ##, I am not really sure how else to approach it (and obviously that won't work if the text file doesn't have the explicit word "chapter" at the start of each chapter).

Thank you again for using this and helping to make it better, I really appreciate it!

Vodou4460 commented 9 months ago

Thank you for your enthusiastic response. I understand the appeal of using NLTK for natural language processing, but I've encountered some challenges with this method in my files. Therefore, I propose an alternative that I believe could be more adaptable and less complex.

My idea is to convert Epub files into text , CSV or MD files. These formats are easily editable and allow for the manual or semi-automatic insertion of chapter markers. This method would involve searching for and replacing specific titles or formats with predefined tags, inspired by the Markdown format. For example, "Chapter 1" could be replaced with "# Chapter 1" to clearly indicate the start of a new chapter.

Following our conversation, I've also been thinking about integrating automated preprocessing into the main script. The idea would be to allow maximum flexibility: those looking for a quick and direct solution could opt for the integrated automation, while those who wish to further customize the file could use a separate script for manual preprocessing.

I believe this semi-automatic approach, combined with the option of automated or manual preprocessing, offers significant flexibility, particularly in adapting to different authors' styles and languages. It also allows for manual intervention for those who wish to further customize the structure of their files.

I hope these proposals will be useful for the project, and I am open to any collaboration to further develop these ideas.

aedocw commented 9 months ago

Let's move this to a discussion: https://github.com/aedocw/epub2tts/discussions/158

aedocw commented 8 months ago

This specific error should be addressed once I merge the work associated with https://github.com/aedocw/epub2tts/issues/193

As for the other things brought up, they will be in discussion #158

friki67 commented 7 months ago

Hello @Vodou4460

I've been playing with your script and it works really great. I've made one change to have a more "human" reading experience. I added a silence after each sentence, because there was no pause between each one, and it sounded strange compared to the "natural" pauses in the sentences.

Because I'm using xtts I've changed two lines in epub2tts.py, in read_chunk_xtts function

line 281 if i < len(sentence_list)-1:to if i < (sentence_list): to apply a silence after the last sentence. line 282 changed the multiplier for the silence duration from 1.0 to 0.6 (1 sec sounds too long to me).

Excuse me if this can be done in a better place or in a more elegant way. I'm not a Python programmer.

If you keep improving your function, please tell us!

@aedocw please consider integrate the "split sentence way" in your app. This combination works really great!

aedocw commented 7 months ago

I pushed up a branch that incorporates this suggestion, and it works well on a very small sample I tried. I'm going to try it with a full book before merging, but I think this is a nice improvement and I'm glad you suggested it!

friki67 commented 7 months ago

Thank you very much. I've been busy playing with this, and learning some Python to try to understand how things work and how to get it work better. I'm now testing this code (please excuse my coding). Tomorrow I'll try some more things, like quotes, double quotes, dialogs punctuation marks (-, .-) ... by now I have this (using @Vodou4460 code)

import re
import datetime
import fire

def reformat_line(line):
   line = line.strip()
   if not line.endswith("."):
        line += "."
   return line

def split_sentence(line):
    # Using regular expression to find split points
    parts = re.split(r'(?<![A-Z])([\.|\?|\!])\s', line)
    # Reconstructing sentences with punctuation characters
    sentences = []
    # this is a mess, but is the only way I've found to make it work
    if len(parts)>1:
        for i in range(0, len(parts)-1, 2):
            sentences.append((parts[i] + parts[i+1]).strip())
        sentences.append((parts[len(parts)-1]).strip())    
    else:
        sentences.append(parts[0].strip()) 
    return sentences

def shorten_sentence(sentence, max_length):
    sentences = []
    while len(sentence)>max_length:
        # find "secondary" puntuation marks, if not, space, if not just cut in max_length    
        if  (cut_point := max(sentence.rfind(',', 0, max_length), 
                              sentence.rfind(';', 0, max_length),
                              sentence.rfind(':', 0, max_length)))<=0:
            if (cut_point := sentence.rfind(' ', 0, max_length))<=0:
                cut_point=max_length

        sentences.append(sentence[:cut_point+1].strip())
        # rest of sentence
        sentence = sentence[cut_point+1:].strip()
    sentences.append(sentence)
    return sentences

def save_to_file(lines,original_filename,max_length):
    now = datetime.datetime.now()
    timestamp = now.strftime("%Y-%m-%d_%H-%M-%S")
    basename = original_filename.split('.')[0]
    ext = original_filename.split('.')[1]
    new_filename = f"{timestamp}_{basename}_split_{max_length}.{ext}"
    with open(new_filename, 'w', encoding='utf-8') as new_file:
        for line in lines:
            new_file.write(line + '\n')
    return new_filename

def split_and_save_text(original_filename, max_length=239):

    with open(original_filename, 'r') as file:
        text = file.readlines()

    # "normalize" text, delete empty lines, end all lines with "." 
    # because only lines ended with '.' generate a pause after them    
    # Made this because things like:
    #
    # "Don't explain your philosophy. Embody it."
    #   ― Epictetus
    #
    # was "joined" with the next line in text. The oposite for lines
    # processed in shorten_sentence
    text3 = [reformat_line(line) for line in text if line.strip()]

    # split sentences in "primary" punctuation signs 
    text2=[]
    for line in text3:
        if line.startswith('#'):
            text2.append(line)
        else:    
            lines=split_sentence(line)
            text2 += lines

    # split sentences longer than max_length in "seconday" puntuation
    # or in space        
    text3=[]
    for line in text2:
        if len(line)<=max_length:
            text3.append(line) 
        else:
            for line2 in shorten_sentence(line,max_length):
               text3.append(line2)  

    print(save_to_file(text3,original_filename,max_length))      

if __name__ == "__main__":
    fire.Fire(split_and_save_text)

you can use it python3 thisprogram.py textfile.txt max_length (I set max_length default to 239 because it is the max for Spanish). Until now it's working very well, because it is able to split sentences without punctuation marks and it sounds "natural" when playing all together, and making the mods in your code that I wrote about makes the reading sound really great.

I'm using it only for XTTS.

Once again, thank you very much!

friki67 commented 7 months ago

I pushed up a branch that incorporates this suggestion, and it works well on a very small sample I tried. I'm going to try it with a full book before merging, but I think this is a nice improvement and I'm glad you suggested it!

Hi. Does this branch do the split in sentences? Or should I keep using the code above to split and then process using this branch?

aedocw commented 7 months ago

This branch splits into sentences, but it would be worth trying it with --debug and looking at the file debug.txt to see if it is splitting where you expect it to. The output will have one line for each sentence it sends to TTS.

friki67 commented 7 months ago

This branch splits into sentences, but it would be worth trying it with --debug and looking at the file debug.txt to see if it is splitting where you expect it to. The output will have one line for each sentence it sends to TTS.

Ok. I will give it a go tomorrow, comparing with current results. Thanks

friki67 commented 7 months ago

Hello. I've tested the sentences-pause branch with two test text I have.

The beginning is

Capítulo 1º

Es imposible que un hombre aprenda lo que cree que ya sabe. La dificultad muestra lo que son los hombres

Epicteto

Tal vez te hayas topado con una cita inteligente de un antiguo filósofo estoico o hayas leído un artículo que compartía algunas ideas estoicas inspiradoras. Tal vez un amigo te haya hablado de esa antigua filosofía útil y próspera o hayas estudiado un libro o dos sobre el estoicismo. O, tal vez, aunque hay muy pocas probabilidades, nunca hayas oído hablar de ella.

The only thing I've found is that it joins the 2nd, 3rd and 4th lines, reading

Capítulo 1º Es imposible que un hombre aprenda lo que cree que ya sabe. La dificultad muestra lo que son los hombres Epicteto Tal vez te hayas topado con una cita inteligente de un antiguo filósofo estoico o hayas leído un artículo blah blah blah.....

The rest worked ok. The output:

Computing speaker latents...
Reading from 1 to 1
0%|                                                                                                                                                                 | 0/24 [00:00<?, ?it/s]Capítulo 1º,
------------------------------------------------------
Free memory : 4.304443 (GigaBytes)
Total memory: 7.921936 (GigaBytes)
Requested memory: 0.335938 (GigaBytes)
Setting maximum total tokens (input + output) to 1024
WorkSpace: 0x7f4bd6000000
------------------------------------------------------
Time to first chunck: 1.2127406597137451
Received chunk 0 of audio length 24576
Skipping whisper transcript comparison
4%|██████▍                                                                                                                                                  | 1/24 [00:01<00:29,  1.28s/it
Es imposible que un hombre aprenda lo que cree que ya sabe,
Time to first chunck: 0.9801130294799805
Received chunk 0 of audio length 51200
Skipping whisper transcript comparison
8%|████████████▊                                                                                                                                            | 2/24 
[00:02<00:25,  1.15s/it] La dificultad muestra lo que son los hombres

Epicteto

Tal vez te hayas topado con una cita inteligente de un antiguo filósofo estoico o hayas leído un artículo que compartía algunas ideas estoicas inspiradoras,
Time to first chunck: 1.231780767440796
Received chunk 0 of audio length 65792
Time to first chunck: 2.5522217750549316
Received chunk 0 of audio length 66816
Time to first chunck: 4.093189477920532
Received chunk 0 of audio length 66816
Time to first chunck: 5.565227270126343
Received chunk 0 of audio length 66816
Time to first chunck: 6.066920280456543
Received chunk 0 of audio length 10240
Skipping whisper transcript comparison
12%|███████████████████▏                                                                                                                                     | 3/24
 [00:08<01:14,  3.57s/it]

It is something I've detected. The lines must end with a punctuation sign. That is why I do:

def reformat_line(line):
   line = line.strip()
   if not line.endswith("."):
        line += "."
   return line

This is redundant if line ends with ",","?" or any other punctuation sign, so maybe something like

line = line.strip()
if not line[-1] in [".", "!", "?",","]:
    line += ","

could work.

larry77 commented 7 months ago

Hello, If I update the installation of epub2tts on my machine, will these enhancements be automatically made available?

aedocw commented 7 months ago

The latest update does not include everything noted in this ticket, but it does break down to individual sentences, and includes a consistent pause between each sentence (if you are using XTTS).

On Mon, Feb 26, 2024 at 5:47 AM larry77 @.***> wrote:

Hello, If I update the installation of epub2tts on my machine, will these enhancements be automatically made available?

— Reply to this email directly, view it on GitHub https://github.com/aedocw/epub2tts/issues/153#issuecomment-1964186739, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFBJGLNBQIVRPYXUV366Z3YVSG57AVCNFSM6AAAAABBKVNUGWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRUGE4DMNZTHE . You are receiving this because you were mentioned.Message ID: @.***>

aedocw commented 6 months ago

Please share a sample if you are able to reproduce this error with the current release. I think since we now only send one sentence at a time to TTS, this issue is resolved now.

aedocw / epub2tts

character limit of 273 for language 'fr'" error #153

Functionality Explanation:

Capítulo 1º