Merge of coqui voices broke stuff

aedocw commented 7 months ago

Due to lack of complete testing, the merge that made studio voices work also broke a bunch of other stuff. This merge fixes that, and also includes a test script that I will use in the future to validate a few common use cases. It would be nice to add some real tests into CI, but the test runners do not have GPU, so would not be that useful.

danielw97 commented 7 months ago

Hi again, If you'd rather have a separate issue for this let me know, although in my testing just now after pulling your most recent commit I'm getting the following error, as xtts is getting sent a bigger text chunk than it can handle I believe. This is using one of the coqui studio voices, btw.

Error: ❗ XTTS can only generate text with a maximum of 400 tokens. ... Retrying (0 retries left)
This is a longer paragraph, although using a finetuned model last week with the same book didn't have this problem. Thanks for all of your work.

aedocw commented 7 months ago

Hmm, I might need to fully revert this then. I have not tested with really long text, so have not run into exceeding the tokens.

I think it's fair to keep it under this issue as the merge of coqui voices sure did break stuff!

For what it's worth though, it is working for me with current chunk size with epubs, maybe I need to test with text longer than what I have sent to it so far.

On Sat, Dec 23, 2023 at 2:37 PM danielw97 @.***> wrote:

Hi again, If you'd rather have a separate issue for this let me know, although in my testing just now after pulling your most recent commit I'm getting the following error, as xtts is getting sent a bigger text chunk than it can handle I believe. This is using one of the coqui studio voices, btw.

Error: ❗ XTTS can only generate text with a maximum of 400 tokens. ... Retrying (0 retries left) This is a longer paragraph, although using a finetuned model last week with the same book didn't have this problem. Thanks for all of your work.

— Reply to this email directly, view it on GitHub https://github.com/aedocw/epub2tts/issues/126#issuecomment-1868382131, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFBJGOAY54TN3RDBNQK6R3YK5MI5AVCNFSM6AAAAABBBBXDVWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRYGM4DEMJTGE . You are receiving this because you modified the open/close state.Message ID: @.***>

danielw97 commented 7 months ago

Other than that everything seems to be working, I wonder is it possible to encorperate the same segmenting code that is used with xtts as I assume the limits are the same (if it isn't already)? I think this is an outlier with a longer paragraph that xtts handled fine although I can see how it may cause issues.

danielw97 commented 7 months ago

Also, the same text processing at least in my mind should be used, as I believe this is basically xtts under the hood unless I am incorrect.

aedocw commented 7 months ago

I'm going to reopen this one until things are sorted out.

The difference has to do with how I'm calling XTTS between these two ways. The way I use the xtts cloning model is with their inference streaming approach but the docs don't indicate how you can use that with their voices.

danielw97 commented 7 months ago

Okay, thanks. Not a rush especially this time of year with the holidays of course, although wanted to let you know the errors I was seeing. Maybe they've not implemented the streaming approach with their studio voices, although might be worth asking on the Discord as it might just not be documented.

aedocw commented 7 months ago

Haha yes I have asked on discord, no answer yet though (and someone else just asked the same question today).

I put a potential fix in the branch "fixes" if you want to try it out when you get a chance.

As far as the holidays, it's OK, this is relaxing and I always sleep better after fixing some bugs :)

Thanks, and happy holidays to you too!

danielw97 commented 7 months ago

Thanks, I've got some time this evening and will test this now. Edit: I've run the troublesome paragraph and that seems to have fixed it, appreciate your quick work on this.

aedocw commented 7 months ago

Thanks, I appreciate your testing and all your feedback!

You should see this, indicating the right xtts version:


 > tts_models/multilingual/multi-dataset/xtts_v2 is already downloaded.                                                                       
 > Using model: xtts ```

danielw97 commented 7 months ago

Yes, that's what I got in the end. I made a silly mistake the first time specifying --model instead of --engine although that was fixed fairly quickly, all good now though.

aedocw commented 7 months ago

Excellent! I'll figure out what's going on with other languages hopefully tonight and merge this branch.

aedocw commented 7 months ago

Found the problem with Coqui voices, reading plain text (rather than epub), and specifying a language other than english. On line 164 I replace all periods with commas if language != en, and that seems to break something along the way (maybe it confuses the segmenter that breaks everything up into individual sentences).

Replacing periods with commas did seem to help for non-english languages where it would seem to always pronounce the period at the end of sentences as "dot" or some variation of that. I changed that to happen now just before the sentence is sent to TTS, hopefully it is still effective for other languages.

aedocw commented 7 months ago

I believe things are all fixed now, please log bugs as always :)

aedocw / epub2tts

Merge of coqui voices broke stuff #126