collabora / WhisperSpeech

An Open Source text-to-speech system built by inverting Whisper.
https://collabora.github.io/WhisperSpeech/
MIT License
3.93k stars 214 forks source link

Unknown error message, just FYI #120

Open BBC-Esq opened 7 months ago

BBC-Esq commented 7 months ago

I'm getting the following error with some slight variations, but it's basically the same.

Error processing text to audio: cannot reshape tensor of 0 elements into shape [1, 0, 12, -1] because the unspecified dimension size -1 can be any value and is ambiguous

It's occurring when WhisperSpeech tries to playback certain text like this, which are Georgia statutes:

1 O.C.G.A. § 15-11-145(g).
2 O.C.G.A. § 15-11-145(h).
3 O.C.G.A. § 15-11-181(a).
4 O.C.G.A. § 15-11-181(b).
5 O.C.G.A. § 15-11-102.

Just FYI, not sure how you'd handle strange non-engligh or other language characters like section symbols and a variety of other types of symbols...I could curate the text beforehand, but thought you'd like to know anyways incase there's some precautions you could take internally...

My program that interacts with an LLM and uses TTS also uses Bark, and Bark screws up as well, says gibberish, skips a few words, but then picks back up and is able to hobble to the end...just fyi, seems like they've done something to handle strange characters...

jpc commented 6 months ago

I am not getting the error you are seeing with these samples. They are not spoken correctly but the model finished generating successfully. Would you mind trying to find a short code snippet with the text which consistently fails for you?

I've also noticed that we do lack support for a lot of special symbols. Since they were not in the training set the model never learned anything sensible about them so they just end up as random sounds and also confuse the decoding of the subsequent text.

You could try using some regexes to strip them out. Also the speaking speed we are using in characters per second is causing issues here with the numbers since numbers cannot really be spoken as quickly as normal words.

For the samples you provided this workaround worked quite well for me:

pipe.generate_to_notebook("1 O C G A  15 11 145 g", cps=6)

It seems you don't have to strip the -. In longer text I also noticed that replacing parenthesis (with commas) improves the prosody. Like this …replacing parenthesis ,with commas, improves….

sidharthrajaram commented 6 months ago

I receive the same error as @BBC-Esq :

Error: cannot reshape tensor of 0 elements into shape [1, 0, 12, -1] because the unspecified dimension size -1 can be any value and is ambiguous

Inputs that triggered it: "2." "3."

sidharthrajaram commented 6 months ago

It specifically occurs after performing inference repeatedly. Doing inference for "2." repeatedly leads to inference working a bunch of times before resulting in the error.

sidharthrajaram commented 6 months ago

Specific error trace on Inference Colab:

Screenshot 2024-04-17 at 3 17 17 PM
chazo1994 commented 3 months ago

I face same issue after perform inference in multiple sentence. I could be an error of caching k,v?

chazo1994 commented 3 months ago

@BBC-Esq @jpc Have you fix this issue yet?

BBC-Esq commented 3 months ago

@chazo1994 I haven't had it occur since but then again I'm using the program in a different context so it's not trying to say problematic things...but if I recall, I did speak with the repository maintainer at some point and he indicated it might have something to do with those kids of characters. Sorry I can't be much more help.