Closed danielw97 closed 7 months ago
I've noticed that apostrophes are sometimes not processed properly too. Right now the way I'm stripping some punctuation and entirely dropping other stuff is clunky and naive, it could definitely use improvement.
All good, not to worry. My main thinking was to flag it up just so you are aware and those are the two symbols I noticed it struggling with.
I've pushed up changes in the linked branch (improve-punctuation-handling) and it seems to work nicely. It also addresses an issue I had noticed where sometimes new-lines in a file would cause the last word on one line to be joined with the first word on the next line. That might have been from changing the tokenizer that detects sentences, not sure.
I'll wait until someone else validates this is an improvement before merging - I need to remember I'm not the only user, haha!
I've just done a quick test, and this fixes the issue with the ' character. However, the ’ character is being pronounced as an accent of some kind. Is there any way to normalize this i.e. the ’ for ', as at least in my testing the ’ character is present in some ebooks as an apostrophe? Thanks as always for the quick turnaround.
Just to say that I've tried this again, and both characters render properly this time.
Excellent!
Hello again, I've just been listening to some of the output I've processed, and have noticed that with the vits model words like "I'd" or "I'm" aren't getting processed correctly. I now see that instead of the traditional apostrophe (') another variation (’) is being used. There is a similar patern I've observed for quotation marks, although I believe those are stripped regardless. I'm not sure if there is a way to normalize these characters, or if the ' and ’ are stripped if they cause issues for the models to process as well. This happens on both Windows and Linux.