Open ukolovda opened 9 months ago
This seems to be dependent on the language, I see a similar effect with -l fi and several others. My understanding is that the problem originates from the training data so in that sense it can only be worked around, not really fixed. So the model doesn't give you a "russian silence" token, because there wasn't such thing in the training data to begin with. It can perhaps give you an english or italian one, however it's a different set of tokens for each language. But I suppose entropy or compression ratio should give a hint that this is a non-speech portion, even without involving the model?
Multilingual is a bit tricky anyways, because once you set the language you can't change it (as discussed in #1800). So you can't really detect an "english silence" and then switch languages, unless you cut the sample into smaller pieces with VAD/demucs/whatever. Btw I've actually tried giving the model multiple language tokens to see what happens then, but it didn't work very well.
This seems to be dependent on the language, I see a similar effect with -l fi and several others. My understanding is that the problem originates from the training data so in that sense it can only be worked around, not really fixed. So the model doesn't give you a "russian silence" token, because there wasn't such thing in the training data to begin with. It can perhaps give you an english or italian one, however it's a different set of tokens for each language. But I suppose entropy or compression ratio should give a hint that this is a non-speech portion, even without involving the model?
Multilingual is a bit tricky anyways, because once you set the language you can't change it (as discussed in #1800). So you can't really detect an "english silence" and then switch languages, unless you cut the sample into smaller pieces with VAD/demucs/whatever. Btw I've actually tried giving the model multiple language tokens to see what happens then, but it didn't work very well.
I reached same conclusion about Urdu, model is limited and is not very good for low resource languages, and can't handle silence for Urdu, and I could not find any VAD model that did well with Urdu non speech either. So, I'm stuck with high WER.
also having some weird sentences coming out of nowhere, russian lang "Редактор субтитров А.Семкин Корректор А.Егорова"
found this list of hallucination as well
https://gist.github.com/waveletdeboshir/8bf52f04bf78018194f25b2390c08309
I try process WAV file with zeroes in Data section. File duration is 1,2 seconds (attached it).
Whisper.cpp give hallucination (and wrong duration).
zeroes.zip
I check it on last master branch:
I think, this is a bug.