punctuation: ’ character appears to not be processed with vits model

aedocw / epub2tts

Turn an epub or text file into an audiobook

Apache License 2.0

445 stars 44 forks source link

punctuation: ’ character appears to not be processed with vits model #106

Closed danielw97 closed 7 months ago

danielw97 commented 7 months ago

Hello again, I've just been listening to some of the output I've processed, and have noticed that with the vits model words like "I'd" or "I'm" aren't getting processed correctly. I now see that instead of the traditional apostrophe (') another variation (’) is being used. There is a similar patern I've observed for quotation marks, although I believe those are stripped regardless. I'm not sure if there is a way to normalize these characters, or if the ' and ’ are stripped if they cause issues for the models to process as well. This happens on both Windows and Linux.

aedocw commented 7 months ago

I've noticed that apostrophes are sometimes not processed properly too. Right now the way I'm stripping some punctuation and entirely dropping other stuff is clunky and naive, it could definitely use improvement.

danielw97 commented 7 months ago

All good, not to worry. My main thinking was to flag it up just so you are aware and those are the two symbols I noticed it struggling with.

aedocw commented 7 months ago

I've pushed up changes in the linked branch (improve-punctuation-handling) and it seems to work nicely. It also addresses an issue I had noticed where sometimes new-lines in a file would cause the last word on one line to be joined with the first word on the next line. That might have been from changing the tokenizer that detects sentences, not sure.

I'll wait until someone else validates this is an improvement before merging - I need to remember I'm not the only user, haha!

danielw97 commented 7 months ago

I've just done a quick test, and this fixes the issue with the ' character. However, the ’ character is being pronounced as an accent of some kind. Is there any way to normalize this i.e. the ’ for ', as at least in my testing the ’ character is present in some ebooks as an apostrophe? Thanks as always for the quick turnaround.

danielw97 commented 7 months ago

Just to say that I've tried this again, and both characters render properly this time.

aedocw commented 7 months ago

Excellent!