KoljaB / RealtimeTTS

Converts text to speech in realtime
1.72k stars 153 forks source link

Question about tokenizer #106

Open FivespeedDoc opened 2 months ago

FivespeedDoc commented 2 months ago

Is there any way to adjust tokenizer parameters that how the tokenizer(?) divides the sentences? May I ask how sentence-splitting is done when the program is configured to (being~by) feed generator iterators?

For example, if I feed this string(or whatever you call it) word by word(punctuation counts as one word).

"Hello, can you tell me how the sentence splitting works? I want to know the performance difference between ntlk and stanza."

Will the tokenizer be able to split the two sentences? If so, is there any way to adjust the tokenizer for its way to divide the sentences? Or perhaps it just waits until the entire string is received and calls the TTS engine then.

I am using language which is not English, but I noticed little difference between tokenizer as "nltk" or "stanza"

Simply put, I would want to know if there is a way to make stream output faster when feeding the engine with generator iterators.

Really appreciate if you could help! Thanks

KoljaB commented 2 months ago

Is there any way to adjust tokenizer parameters that how the tokenizer(?) divides the sentences?

Yes, please look into the code that performs the sentence splitting. Most of the parameters of this method which customize the sentence splitting are also present in the play and play_async methods of RealtimeTTS.

For example there is the context_size parameter of the play method, which is used to establish context for sentence boundary detection by the tokenizers. It sets the number of characters that are additionally presented to the tokenizer after a sentence boundary like a punctuation. A larger context improves the accuracy of detecting sentence boundaries, lower context would make it faster.

May I ask how sentence-splitting is done when the program is configured to (being~by) feed generator iterators?

When calling play or play_async method RealtimeTTS will start to consume the generator(s) and retrieved textchunks. It will present the accumulated chunks to the tokenizer, which then tries to detect a full sentence from those.

Will the tokenizer be able to split the two sentences?

Yes it should, at least that's its job :)

If so, is there any way to adjust the tokenizer for its way to divide the sentences?

Not more than in these parameters.

Simply put, I would want to know if there is a way to make stream output faster when feeding the engine with generator iterators.

fast_sentence_fragment parameter "overwrites" the tokenizers by also searching for delimiters like "," or "-" which would not mark a full sentence but a "somewhat synthesizable fragment". So if you use these you can speed up TTS generation in the first retrieved sentence, which can be quite crucial for generating a really fast answer. Gives up a bit on synthesis quality of course. If you don't mind, you can also finetune all other parameters to "fast" settings. Like setting context_size to 2 instead of default 12. Or minimum_sentence_length to 3 or 4 instead of 10.

FivespeedDoc commented 1 month ago

Is there any way to adjust tokenizer parameters that how the tokenizer(?) divides the sentences?

Yes, please look into the code that performs the sentence splitting. Most of the parameters of this method which customize the sentence splitting are also present in the play and play_async methods of RealtimeTTS.

For example there is the context_size parameter of the play method, which is used to establish context for sentence boundary detection by the tokenizers. It sets the number of characters that are additionally presented to the tokenizer after a sentence boundary like a punctuation. A larger context improves the accuracy of detecting sentence boundaries, lower context would make it faster.

May I ask how sentence-splitting is done when the program is configured to (being~by) feed generator iterators?

When calling play or play_async method RealtimeTTS will start to consume the generator(s) and retrieved textchunks. It will present the accumulated chunks to the tokenizer, which then tries to detect a full sentence from those.

Will the tokenizer be able to split the two sentences?

Yes it should, at least that's its job :)

If so, is there any way to adjust the tokenizer for its way to divide the sentences?

Not more than in these parameters.

Simply put, I would want to know if there is a way to make stream output faster when feeding the engine with generator iterators.

fast_sentence_fragment parameter "overwrites" the tokenizers by also searching for delimiters like "," or "-" which would not mark a full sentence but a "somewhat synthesizable fragment". So if you use these you can speed up TTS generation in the first retrieved sentence, which can be quite crucial for generating a really fast answer. Gives up a bit on synthesis quality of course. If you don't mind, you can also finetune all other parameters to "fast" settings. Like setting context_size to 2 instead of default 12. Or minimum_sentence_length to 3 or 4 instead of 10.

I did not find the constructor containing these keywords, so I have to modify the origin(package) file. Is there anything I'm missing? Also, I find that the second sentence takes significantly more time to generate than the first(the two sentence are similar in length), even if I changed the parameters in the package file into very aggressive settings, I observe(or presume) that the tokenizer only continues feeds into the stream after ALL of the rest of the chunk is fed, any ideaon what may cause this?

Thanks!

FivespeedDoc commented 1 month ago

Also, is it possible to use another tokenizer(other than nltk or stanza)

KoljaB commented 1 month ago

I did not find the constructor containing these keywords, so I have to modify the origin(package) file. Is there anything I'm missing?

The parameters are part of the play- and play_async-methods.

Also, I find that the second sentence takes significantly more time to generate than the first(the two sentence are similar in length), even if I changed the parameters in the package file into very aggressive settings, I observe(or presume) that the tokenizer only continues feeds into the stream after ALL of the rest of the chunk is fed, any ideaon what may cause this?

Can you provide example code to reproduce this?

Also, is it possible to use another tokenizer(other than nltk or stanza)

Currently not, you'd need to change the code of stream2sentence library to do that.

FivespeedDoc commented 1 month ago

Can you provide example code to reproduce this?

Yes, here's the code to reproduce the issue, however the example is in Chinese, but I think it is pretty obvious to observe the issue, I am using azure engine and voice="zh-CN-XiaoshuangNeural" stream.language = "zh-CN". The '/' between characters of the data represents a new line. As you could see, each time(mostly) one or two Chinese character is fed from the generator into the stream.

def line_generator(data):
    # Split the input data by the newline character to get individual lines
    lines = data.split('/')
    for line in lines:
        print(line)
        time.sleep(0.01)
        yield line

"""  
After the first line is generated and played [ 胡/爷/爷,我/来/给/您/讲/一下/下/周/每/天/的/安/排。 ]
        It will not continue generation until all of the text has been feed into the stream.
        If you change the time.sleep to time.sleep(0.1) it is very obvious
"""

#The input data 
data = """
胡/爷/爷,我/来/给/您/讲/一下/下/周/每/天/的/安/排。 
周/一/:/9:00-10:00:晨/练/太/极/拳/,/地点/:/活/动/室/
10:30-11:30:园/艺/活/动/菠菜/种/植/,/地点/:/花/园/
14:00-15:00:手/工/制/作/睡/眠/香/囊/,/地点/:/手/工/室/
15:30-16:30:观/看/老/电/影/,/地点/:/影/音/室/

周/二/:/9:00-10:00:八/段/锦/简/化/版/,/地点/:/大/厅/
10:30-11:30:书/法/练/习/,/地点/:/书/画/室/
14:00-15:00:棋/牌/娱/乐/象/棋/、/围/棋/等/,/地点/:/棋/牌/室/
15:30-16:30:养/生/讲/座/春/天/养/生/1/,/地点/:/会/议/室/

周/三/:/9:00-10:00:晨/练/太/极/拳/,/地点/:/活/动/室/
10:30-11:30:园/艺/活/动/种/植/花/卉/,/地点/:/花/园/
14:00-15:00:手/工/制/作/编/织/,/地点/:/手/工/室/
15:30-16:30:音/乐/欣/赏/,/地点/:/影/音/室/

周/四/:/9:00-10:00:坐/式/健/身/操/,/地点/:/活/动/室/
大/厅/
10:30-11:30:绘/画/活/动/素/描/、/水/彩/等/,/地点/:/书/画/室/
14:00-15:00:读/书/会/分/享/读/书/心/得/,/地点/:/阅/读/室/
15:30-16:30:观/看/旅/游/纪/录/片/三/亚/,/地点/:/影/音/室/

周/五/:/9:00-10:00:健/身/操/挖/呀/挖/科/目/三/,/地点/:/活/动/室/
大/厅/
10:30-11:30:记/忆/阅/读/,/地点/:/怀/旧/室/
14:00-15:00:智/能/手/机/使/用/微/信/,/地点/:/会/议/室/
15:30-16:30:桌/游/娱/乐/狼/人/杀/、/剧/本/杀/,/地点/:/娱/乐/室/

周/六/:/9:00-10:00:晨/练/太/极/拳/,/地点/:/活/动/室/
10:30-11:30:园/艺/活/动/种/植/蔬/菜/,/地点/:/花/园/
14:00-15:00:手/工/制/作/剪/纸/,/地点/:/手/工/室/
15:30-16:30:观/看/老/电/影/,/地点/:/影/音/室/

周/日/:/9:00-10:00:聚/会/观/看/老/电/影/,/地点/:/影/音/室/
大/厅/
10:30-11:30:聚/会/观/看/老/电/影/,/地点/:/影/音/室/
大/厅/
14:00-15:00:聚/会/观/看/老/电/影/,/地点/:/影/音/室/
大/厅/
15:30-16:30:聚/会/观/看/老/电/影/,/地点/:/影/音/室/
大/厅/"""

response = line_generator(data)
stream.feed(response).play()

And this is why I observe(or presume) that the tokenizer only continues feeds into the stream after ALL of the rest of the chunk is fed.

Thanks again for your help:)

KoljaB commented 1 month ago

If you use stream.feed(response).play() then RealtimeTTS would use the standard tokenizer, which is nltk. For chinese you want stanza:

stream.play(
            minimum_sentence_length=2,
            minimum_first_fragment_length=2,
            tokenizer="stanza",
            language="zh",
            context_size=2)

Then what I think is going on goes back to the data. Your first line looks like this: "胡/爷/爷,我/来/给/您/讲/一下/下/周/每/天/的/安/排。" So it contains a chinese sentence end character: "。" The tokenizer has a chance to identify this as a complete sentence. But then all that text that follows does not contain any of those characters. The tokenizer gets presented some chinese characters but nothing it could create a sentence boundary from.

KoljaB commented 1 month ago

Made some tests, finally I can reproduce it. This does not seem to behave like it should, maybe something wrong in stream2sentence, please give me some time to look into that.

KoljaB commented 1 month ago

I think I have a bugfix for stream2sentence now and hopefully with that one the problem should be gone.
Could you please update to the latest version with

pip install stream2sentence==0.2.4

and tell me if it's better with this one?

FivespeedDoc commented 1 month ago

I think I have a bugfix for stream2sentence now and hopefully with that one the problem should be gone. Could you please update to the latest version with

pip install stream2sentence==0.2.4

and tell me if it's better with this one?

Sure, it is better now. Since the requirements in RealtimeTTS isn't updated, I manually pulled RealtimeTTS from github, changed the stream2sentence==0.2.4 in the requirements.txt of and built python dependencies(package) from it. With setting

minimum_sentence_length=2, minimum_first_fragment_length=2, tokenizer="stanza",
                                  language="zh", context_size=2 

It is much better. It doesn't seem to be working for tokenizer=nltk, though. I find the performance cost extremely high(it almost drained my M3Max CPU), is there any way to mitigate this? Thanks

FivespeedDoc commented 1 month ago

If I want to switch to another tokenizer(other than nltk or stanza), which files would need to be modified?

KoljaB commented 1 month ago

I find the performance cost extremely high(it almost drained my M3Max CPU), is there any way to mitigate this? You are right, stanza tokenizer is pretty heavyweight. I don't know of any methods to make it more efficient.

If I want to switch to another tokenizer(other than nltk or stanza), which files would need to be modified? I just update RealtimeTTS to version 0.4.4, there is this now in the play methods:

tokenize_sentences (callable)

So you can implement your own sentence splitting algo or hook in another tokenizer here.

Example:

if __name__ == '__main__':
        import re

        def tokenize_sentences(text):
                """
                Splits the input text into sentences using simple heuristics.

                Args:
                        text (str): The input text to be split into sentences.

                Returns:
                        list: A list of sentences.
                """
                # Define sentence-ending punctuation
                sentence_ends = r'[.!?。\n]'

                # Define abbreviations and other exceptions
                abbreviations = r'\b(Mr|Mrs|Dr|Ms|Sr|Jr|etc|e\.g|i\.e|vs|U\.S\.A|D\.C)\.'

                # Split the text into potential sentences
                # Split the text into potential sentences
                potential_sentences = re.split(f'({sentence_ends}(?:\\s|$))', text)

                # Combine the split parts back into sentences
                sentences = []
                current_sentence = ''
                for i, part in enumerate(potential_sentences):
                        current_sentence += part

                        # Check if this part ends with sentence-ending punctuation
                        if re.search(sentence_ends + r'(?:\s|$)', part):
                                # Check if the period is part of an abbreviation
                                if not re.search(abbreviations + r'$', current_sentence.strip()):
                                        # Check if the next part starts with a lowercase letter
                                        if i + 1 < len(potential_sentences) and re.match(r'^\s*[a-z]', potential_sentences[i+1]):
                                                continue

                                        sentences.append(current_sentence.strip())
                                        current_sentence = ''

                # Add any remaining text as the last sentence
                if current_sentence:
                        sentences.append(current_sentence.strip())

                return sentences

        # Example usage
        text = "Hello, world! This is a test. Mr. Smith went to Washington D.C. this morning. Is this working?"
        result = tokenize_sentences(text)
        print(result)

        import os
        import time

        from RealtimeTTS import TextToAudioStream, AzureEngine
        engine = AzureEngine(os.environ.get("AZURE_SPEECH_KEY"), os.environ.get("AZURE_SPEECH_REGION"), 
                             voice="zh-CN-XiaoshuangNeural")
                             #voice="zh-CN-XiaoxiaoNeural")
        stream = TextToAudioStream(engine)

        def line_generator(data):
                # Split the input data by the newline character to get individual lines
                lines = data.split('/')
                for line in lines:
                        line = line + "。"
                        print(f"GEN: {line}")
                        time.sleep(0.01)
                        yield line

        #The input data 
        data = """
        胡/爷/爷,我/来/给/您/讲/一下/下/周/每/天/的/安/排。 
        周/一/:/9:00-10:00:晨/练/太/极/拳/,/地点/:/活/动/室/
        10:30-11:30:园/艺/活/动/菠菜/种/植/,/地点/:/花/园/
        14:00-15:00:手/工/制/作/睡/眠/香/囊/,/地点/:/手/工/室/
        15:30-16:30:观/看/老/电/影/,/地点/:/影/音/室/
        """

        response = line_generator(data)
        stream.feed(response)
        stream.play(
            minimum_sentence_length=1,
            minimum_first_fragment_length=1,
            before_sentence_synthesized = lambda sentence: 
                print("Synthesizing: " + sentence),
            tokenizer="None",
            tokenize_sentences=tokenize_sentences,
            language="zh",
            context_size=2)