152334H / tortoise-tts-fast

Fast TorToiSe inference (5x or your money back!)
GNU Affero General Public License v3.0
771 stars 179 forks source link

Include CVVP amout argument #15

Closed benorelogistics closed 1 year ago

benorelogistics commented 1 year ago

The CVVP Model is still in the system. Add the option to use it.

station384 commented 1 year ago

This combined Pull request adds in some missing arguments to the do_tts and read. This also changes the default length of the sentences the read.py processes from 200 with a max length of 300 to 100 and a max or 200 before breaking the sentence. This is to help prevent the end token not found error which can occur on if the length of the generated speech is to long.

152334H commented 1 year ago

This also changes the default length of the sentences the read.py processes from 200 with a max length of 300 to 100 and a max or 200 before breaking the sentence. This is to help prevent the end token not found error which can occur on if the length of the generated speech is to long.

I'll merge this, but @station384 @benorelogistics could I ask why this adjusts the values passed into read.py, rather than the default args for split_and_recombine_text() in utils/text.py?

benorelogistics commented 1 year ago

The reason why I choose this path was to follow the other nullable items which this falls under.

Originally, I started to implement it into the global Args, which worked if I included it in the command-line but would fail if I excluded it during test (the preset would be overwritten with a null value). The Args lib has a provision for default, but this would override the value stored in the preset which causes a chicken and the egg issue as one value would always override the other. So, looking at the other parameters and I noticed that there where 2 that were handed in a slightly different way and that seems to be the ability to be a excluded if the value was excluded from the command line. So, I went this route, which is not optimal, but I didn’t want to dig deeper into the api.py to compensate for those other 2 side conditions that were not my focus, and which was already functioning.

Also. Think I was tired.

I figure in the next few days when work lets up, I’ll take another look at it as it does bug me that the parameters are handled in 2 places and can be common to any application that utilizes the args.py. And I if figure there has to be a pattern, I’m just not readily aware of in how to accomplish what was my intent using only are arg lib.

station384 commented 1 year ago

This also changes the default length of the sentences the read.py processes from 200 with a max length of 300 to 100 and a max or 200 before breaking the sentence. This is to help prevent the end token not found error which can occur on if the length of the generated speech is to long.

I'll merge this, but @station384 @benorelogistics could I ask why this adjusts the values passed into read.py, rather than the default args for split_and_recombine_text() in utils/text.py?

I changed it only in the read.py because the defaults supplied in the split_and_recombine are perfectly valid on a lot of circumstances, say when the parsed text uses words with less syllables. it doesn't have an issue, this includes a lot of YA writing, but when you start including 3+ syllables or multiple conjugations in the text the error crops up as it has a hard time knowing when to split and just chooses the max and the error occurs. since read.py in all my tests so far encoding various books, and technical documents I've found the best generic setting is 100, 200. But this a tradeoff of efficiency over reliably, the larger breaks of 200,300 does process faster as less batches need to be performed, and flow can be less interrupted. But risking breaks can add steps in the workflow. so, when it does break, there is now an increase in time, as you speed the saved time reencoding the broken bits.

Overall, the split_and_recombine_text needs some reworking; those 2 parameters should be automatically set for every batch of text that comes in. say instead of processing line by line which it does now, but process by paragraph, and then figure out where the sentence breaks are and them figure out the min and max dynamically.

I've been getting around this by preprocessing the text so every paragraph is 1 text line, this gives a much smoother generation without the occasional issue of accents changing in the middle of a sentence. (To solve that one is beyond me. has to do with the model im assuming, but I do have to say it's kind of funny hearing morgan freeman go from a Sothern accent to a British accent to an eastern American all in one sentence)