Closed rejuce closed 6 months ago
I think I'll add in a check during segmentation so it throws out anything that has no characters at all. I've seen other things happen like this, like "..." gets segmented as it's own sentence, and TTS just makes some random sounds. I expect if it required every sentence to have at least one character in it, that should pretty effectively prevent issues like this.
Thanks for logging the bug with this information, really appreciate it! I should be able to get to this soon.
I can't seem to trick the tokenizer into creating just a single item with just a quote in it. I added stuff in the "dont-say-that" branch to drop any sentences that do not have any letters or numbers in them though. If you can check that branch out and try it against the copy you have I would appreciate it. Thanks!
I manually inserted some items like just '"' as a sentence, and some others that had nothing but punctuation, and they are safely removed with this. I'm going to merge it, as I'm eager to do everything possible to sanitize what gets sent to TTS.
how do I update? pull and then pip install . again?
Yes, update with "git pull" and then you can do "pip install . --update"
Text splitted to sentences.
['The new field of Monsterology that Ichiha created has led to new perspectives on the origin of man, animal, plant, and monster.', '”', 'The dungeon theory on the origin of life, huh?', "This was also information we hadn't made public, but, Yeah, she has a point.", '“In my view, this is a phenomenon that fuses science and religion, and links our worlds together. You must have some sense of that yourself,” Genia continued, sounding uncharacteristically serious.']
possible workaround: remove all " before piping to tts??