erew123 / alltalk_tts

AllTalk is based on the Coqui TTS engine, similar to the Coqui_tts extension for Text generation webUI, however supports a variety of advanced features, such as a settings page, low VRAM support, DeepSpeed, narrator, model finetuning, custom models, wav file maintenance. It can also be used with 3rd Party software via JSON calls.
GNU Affero General Public License v3.0
816 stars 91 forks source link

Edited text cleaning to support some commonly used fullwidth/CJK punctuations #190

Closed ytt246 closed 4 months ago

ytt246 commented 4 months ago

List of characters added: u2018 u2019    Left and Right Single Quotation Mark u201C u201D    Left and Right Double Quotation Mark u3001    Ideographic Comma u3002    Ideographic Full Stop uFF01    Fullwidth Exclamation Mark uFF0C    Fullwidth Comma uFF1A    Fullwidth Colon uFF1B    Fullwidth Semicolon uFF1F    Fullwidth Question Mark

Replacing u2026 Horizontal Ellipsis with a single full stop.

erew123 commented 4 months ago

Hi @ytt246

I've had a test with this and the let sentences = cleanedText.split(/(?<=\[.!?:\])\\s+|(?<=\[\u3002\uFF01\uFF1A\uFF1F\])/); in the tts_generator breaks English/Latin alphabet splitting.

I believe it actually needs to be let sentences = cleanedText.split(/(?<=[\[.!?:\]\u3002\uFF01\uFF1A\uFF1F\uFF61])\s*/); but as I don't read/speak Chinese or Korean, maybe you would like to try this updated line and test it.

If you change anything and want to test an English/Latin splitting, feel free to use this (there are some purposeful errors in it):

In the heart of the bustling city, amidst the towering skyscrapers, lies a quaint little café: "The Crimson Bean." Here, locals and tourists alike gather to savor the rich aroma of freshly brewed coffee, the tantalizing flavors of artisan pastries, and the warm embrace of camaraderie. As the sun sets, casting hues of orange and pink across the horizon, the café buzzes with life. Laughter echoes through the air, punctuated by the clinking of cups and the gentle hum of conversation. Suddenly, a hush falls over the crowd as a lone violinist enters, filling the room with melodic strains that stir the soul! Customers pause mid-sip, captivated by the enchanting melody: transported to a realm of dreams and memories. Outside, the city pulses with energy, its streets teeming with people rushing to and fro, each with their own story to tell. Amidst the chaos, a solitary figure stands; observing the world with quiet contemplation? With a sigh, they turn away, disappearing into the night, leaving behind the hustle and bustle, seeking solace in the embrace of the stars.

As for the added characters, I believe those should be ok, but a warning on this. The Coqui TTS AI model goes off and makes some very strange sounds when it gets certain punctuation, hence I strip a lot of it from English/Latin. e.g. there is no point passing an * (asterisk) to the TTS engine, it doesnt say the word asterisk, it doesnt pronounce anything differently, it just usually makes a ooohoelw sound 50% of the time. So although it looks right to have the asterisk in the text sent to the TTS engine, it doesn't actually change anything or improve the situation with TTS generation. Hence I also strip a lot of things like double quotes etc before they hit the TTS engine for generation.

Saying that I cannot speak for how Korean, Chinese, Hindi etc works or how well the TTS engine pronounces words, but its quite advisable to strip certain characters out and I cannot say how these:

u2018 u2019 Left and Right Single Quotation Mark u201C u201D Left and Right Double Quotation Mark u3001 Ideographic Comma u3002 Ideographic Full Stop uFF01 Fullwidth Exclamation Mark uFF0C Fullwidth Comma uFF1A Fullwidth Colon uFF1B Fullwidth Semicolon uFF1F Fullwidth Question Mark

will interplay with TTS generation and strange sounds. I think these will be ok, well not sure on the quotation marks, but I guess we will see what happens!

I can say that at least the next version of AllTalk (when I finish coding it) will be one single place, where you can update easily and the change passes across the whole code base:

image

So if you want to give let sentences = cleanedText.split(/(?<=[\[.!?:\]\u3002\uFF01\uFF1A\uFF1F\uFF61])\s*/); a go and we can update the code if that works.

Let me know when you've checked your end and thanks for the submission :)

ytt246 commented 4 months ago

Yes you're right, let sentences = cleanedText.split(/(?<=[\[.!?:\]\u3002\uFF01\uFF1A\uFF1F\uFF61])\s*/); works, what I had in my code didn't work. However, this will split a!b to ["a!", "b"] even when there is no space. If this is the intended behaviour then great, but if you only want to split a! b but not a!b in English/Latin, you can use let sentences = cleanedText.split(/(?<=[\[.!?:\]])\s+|(?<=[\u3002\uFF01\uFF1A\uFF1F])/);.

For the effects of those characters, I've done plenty of tests on each of them. I only tested in Chinese since its the only one I speak. From my experience I feel they have a positive impact on the fluency of the audio. Except the quotations, which I couldn't tell if they have any impact. I just kept them since the English version of those are kept.

When testing with Chinese, I find the punctuations don't work as well as in English. There are also often some random phrases repeated a second time, especially when generating long audio. However, those problem still exist when I switch all punctuations to their English counterparts. So it's probably not caused by those added characters, just the TTS model being not as good in Chinese.

erew123 commented 4 months ago

Hi @ytt246

I was too deep in other code to stop and test this out again. Ive managed to now and all seems good so I will merge in the PR.

Re: "TTS model being not as good in Chinese" I suspect you are correct. I think on v2 of AllTalk I will make the 2,0,3 model the default XTTS model to download. I cant say if that one will be any better in Chinese, it may be! On the other side of that though, I am hoping to put in loaders for other TTS engines and hopefully make it simple to code up adding new TTS engines/models as they come along (I think have about 20 different engines already listed as possible ones to add). So that would certainly give some flexibility on generating TTS.

Thanks for helping out though and Ill merge this in now.

All the best

ytt246 commented 4 months ago

@erew123 I've just noticed an error in my regex

let sentences = cleanedText.split(/(?<=[\[.!?:\]])\s+|(?<=[\u3002\uFF01\uFF1A\uFF1F])/); will not split in the following scenario:

“Frequently.” “How often?” “Well, some hundreds of times.”

since the full stop and questionmark is not followed by a space. If you would like to change it to let sentences = cleanedText.split(/(?<=[\[.!?:\]]['"\u2018\u2019\u201C\u201D]*)\s+|(?<=[\u3002\uFF01\uFF1A\uFF1F])/); that would ensure the correct splitting for this.

Thanks and sorry for the oversight

erew123 commented 4 months ago

@ytt246 No problem. To be honest its nice to have someone else take a look at it. Splitting text out is a real pain in the butt, I certainly suffered plenty of outlier issues in the past.

Anyway, I've given it a quick run through a couple of English scenarios and that seems to split fine to me:

image

I will assume you've checked Chinese etc, so Ill post it up, just after I send you this!

All the best! Thanks