daswer123 / xtts-webui

Webui for using XTTS and for finetuning it
MIT License
667 stars 126 forks source link

Losing whole sentences when generating the wav file #49

Open GamingDaveUk opened 9 months ago

GamingDaveUk commented 9 months ago

We are losing whole sentences or the end of sentences when we generate the audio. At first I thought it to be a training issue, but if you regenerate then you can get the sentence it missed last time back fine only for it to lose another one.

Here is an example of the text we are putting into it:

Ah, a most intriguing premise! Let us journey to the lively realm of Krystara, where the kingdom of Fowlmore flourished under the rule of King Cluckington the Wise. In this realm, chickens were revered as royalty, and their subjects, humans, were mere servants.

Now, in the heart of Fowlmore, there lived a most peculiar human named Sammy. She had a deep affection for chickens, so much so that she owned a magnificent chicken coop she called "Coop Castle." Sammy could communicate with her feathered friends, understanding their pecks and clucks as if they spoke English.

One day, Sammy stumbled upon a mysterious egg in Coop Castle. As she held it gently in her hands, a voice echoed within her mind, "Sammy, your destiny lies with the power of the Squawkstone." She was bewildered but excited by this revelation.

As the days passed, Sammy discovered that the egg hatched into a peculiar chick named Chirpity McChirpface. This chick possessed an uncanny ability to rally chickens from across Fowlmore. Sammy, intrigued by this newfound power, decided to train Chirpity and his growing army of loyal feathered followers.

And so, Sammy's chicken army began to grow in size and strength, with each recruit swearing allegiance to their fearless leader. They practiced their battle cries, perfecting their formations, and learning tactics from Sammy's extensive knowledge of chicken behavior.

One fateful day, Sammy received a vision from the mysterious voice within her mind. It urged her to march her chicken army to the gates of Fowlmore Castle and demand King Cluckington's surrender. Sammy, filled with determination, rallied her troops and set off on their grand quest.

As they approached the castle gates, Sammy's chicken army was met with astonishment by the royal guards. The sight of thousands of chickens marching in perfect formation left them bewildered. The guards hastily reported this unusual event to King Cluckington, who was both amused and alarmed by the prospect of a chicken rebellion.

King Cluckington summoned his wisest advisors to discuss this unforeseen threat. After much deliberation, they devised a plan to welcome Sammy and her army into the castle courtyard for peaceful negotiations. Little did they know that Sammy's true intentions were far more comedic than catastrophic.

When Sammy entered the castle courtyard, she was greeted by King Cluckington himself, who bowed low in respect before his unexpected visitors. Sammy, with a mischievous grin, presented her demands: a single grain of corn for every chicken in her army. The court erupted in laughter as King Cluckington agreed to her terms, knowing full well that Sammy's army consisted only of chickens who loved corn.

Thus, Sammy's attempt to take over the world through her chicken army turned out to be a hilarious farce. Instead of a coup, she had secured a lifetime supply of corn for her beloved feathered friends. King Cluckington declared Sammy an honorary citizen of Fowlmore and bestowed upon her the title of "Chicken Whisperer."

From that day forth, Sammy continued to live peacefully in Fowlmore, sharing her unique bond with chickens and spreading laughter and joy wherever she went.

Would you like to hear more about Sammy's adventures with her chicken army or perhaps a different tale altogether?

In the first gen it missed: "Sammy, your destiny lies with the power of the Squawkstone." She was bewildered but excited by this revelation.

in the second gen it missed: King Cluckington declared Sammy an honorary citizen of Fowlmore and bestowed upon her the title of "Chicken Whisperer."

78Alpha commented 8 months ago

That is a byproduct of the sentence splitter. It will just drop things here and there.

The painful alternative is to do it sentence by sentence. An automated alternative would be to split the text beforehand (like by actual sentence) but, when it is phonemized it might miss parts of the audio at the end.

Example:

The old sentence splitter splits up until it hits the phoneme limit.

if text_split_length is not None and len(text) >= text_split_length:
        text_splits.append("")
        nlp = get_spacy_lang(lang)
        nlp.add_pipe("sentencizer")
        doc = nlp(text)
        for sentence in doc.sents:
            if len(text_splits[-1]) + len(str(sentence)) <= text_split_length:
                # if the last sentence + the current sentence is less than the text_split_length
                # then add the current sentence to the last sentence
                text_splits[-1] += " " + str(sentence)
                text_splits[-1] = text_splits[-1].lstrip()
            elif len(str(sentence)) > text_split_length:
                # if the current sentence is greater than the text_split_length
                for line in textwrap.wrap(
                    str(sentence),
                    width=text_split_length,
                    drop_whitespace=True,
                    break_on_hyphens=False,
                    tabsize=1,
                ):
                    text_splits.append(str(line))
            else:
                text_splits.append(str(sentence))

        if len(text_splits) > 1:
            if text_splits[0] == "":
                del text_splits[0]
    else:
        text_splits = [text.lstrip()]

    return text_splits

I edited mine to fit my need, and it seems to work out, but the text has to be in a very particular format (all lines end in ". ")

if text_split_length is not None and len(text) >= text_split_length:
        #text_splits.append("")
        nlp = get_spacy_lang(lang)
        nlp.add_pipe("sentencizer")
        doc = nlp(text)
        for sentence in doc.sents:
            sentence = str(sentence).replace(". ", ". <>")
            frags = sentence.split("<>")
            text_splits += frags
            #text_splits.append(str(sentence))
            print(sentence)
    else:
        text_splits = [text.lstrip()]

    return text_splits

So it might introduce some abnormalities or behave unusually.

GamingDaveUk commented 8 months ago

That is a byproduct of the sentence splitter. It will just drop things here and there.

The painful alternative is to do it sentence by sentence. An automated alternative would be to split the text beforehand (like by actual sentence) but, when it is phonemized it might miss parts of the audio at the end.

Example:

The old sentence splitter splits up until it hits the phoneme limit.

if text_split_length is not None and len(text) >= text_split_length:
        text_splits.append("")
        nlp = get_spacy_lang(lang)
        nlp.add_pipe("sentencizer")
        doc = nlp(text)
        for sentence in doc.sents:
            if len(text_splits[-1]) + len(str(sentence)) <= text_split_length:
                # if the last sentence + the current sentence is less than the text_split_length
                # then add the current sentence to the last sentence
                text_splits[-1] += " " + str(sentence)
                text_splits[-1] = text_splits[-1].lstrip()
            elif len(str(sentence)) > text_split_length:
                # if the current sentence is greater than the text_split_length
                for line in textwrap.wrap(
                    str(sentence),
                    width=text_split_length,
                    drop_whitespace=True,
                    break_on_hyphens=False,
                    tabsize=1,
                ):
                    text_splits.append(str(line))
            else:
                text_splits.append(str(sentence))

        if len(text_splits) > 1:
            if text_splits[0] == "":
                del text_splits[0]
    else:
        text_splits = [text.lstrip()]

    return text_splits

I edited mine to fit my need, and it seems to work out, but the text has to be in a very particular format (all lines end in ". ")

if text_split_length is not None and len(text) >= text_split_length:
        #text_splits.append("")
        nlp = get_spacy_lang(lang)
        nlp.add_pipe("sentencizer")
        doc = nlp(text)
        for sentence in doc.sents:
            sentence = str(sentence).replace(". ", ". <>")
            frags = sentence.split("<>")
            text_splits += frags
            #text_splits.append(str(sentence))
            print(sentence)
    else:
        text_splits = [text.lstrip()]

    return text_splits

So it might introduce some abnormalities or behave unusually.

interesting. I may give that a go. I dont think the developer is too bothered with this issue or is not able to replicate so been looking for a reliable alternative... not having any luck, so if this fixes it I will be very happy.

efh8fh8h commented 6 months ago

This is happening when you use a finetuned model with some bad traing data. With the base 2.0.2 everthing works as expected. After manual curating all wav files and the whisper transcript, my finetuned models did not have that issue any more. Give it a try :)

cwmcd commented 5 months ago

i'm having the same issue with the collab version. not only is it losing entire blocks of text, but also mixing up and repeating text all while also hallucinating and giving demon voices or the voice morphing into another voice/gender.