Open madkote opened 2 years ago
We store ivector for speaker adaptation per channel, so the different results are possible.
We store ivector for speaker adaptation per channel, so the different results are possible.
Hi @nshmyrev , thanks for reply!
Speak adaptation can explain the difference only when I would not delete the channel explicitly (see del rec
).
For each file a new channel is created with rec = vosk.KaldiRecognizer(...)
and deleted once the file is processed. In the next iteration, same files are transcribed with a "private" channel for each file. So the speaker adaptation would only possible for within a channel.
But the issue is, same audio file transcribed different with its "private" channel. Only the model is shared in the memory.
I agree, the audio sample is not the best to show the issue. Let me record some audio, where it could also happen.
Side note: the effect of different transcription is only seen when the audio is a bit noisy (while driving car, phone call with some background). But I there is nothing extreme, where the decoder should be confused. Since 1 of 3 iterated decodings is wrong/different - and I would like to understand why it happens.
Also, when the model is completely reloaded (rerun the script), transcriptions are always same. So the model may cache something - which sound too strange to me.
The effect is visible on small and big en-US models.
I agree, the audio sample is not the best to show the issue. Let me record some audio, where it could also happen.
Side note: the effect of different transcription is only seen when the audio is a bit noisy (while driving car, phone call with some background). But I there is nothing extreme, where the decoder should be confused. Since 1 of 3 iterated decodings is wrong/different - and I would like to understand why it happens.
Also, when the model is completely reloaded (rerun the script), transcriptions are always same. So the model may cache something - which sound too strange to me.
The effect is visible on small and big en-US models.
Hi @nshmyrev , it would be nice to know your opinion and feedback on this issue.
attached are 20 waves fullaudios.zip (16khz, multiple sentences in each wave, >30sec total).
vosk==0.3.32
vosk-model-en-us-0.22
As I wrote above, before decoding the model is loaded once. Then the loop runs 3 times (or more if needed) , where in each iteration over files list new decoder is created to decode EACH single file.
The issue: transcription of the same file might change after iteration. Note, recognizer is created and destroyed for each file in each iteration.
Results (some examples and most interesting is full8
- all three transcriptions are different)
full1
the theory
he be alright
he be alright
full4
that speed bart is among his best known what
that speed bart is among his best known works
that speed bart is among his best known works
full8
they were mostly used to home express fruit thompson some hold suburban passenger trains
they were mostly used to home express fruit and the soul hold suburban passenger trains
they were mostly used to home express fruit and the hold suburban passenger trains
`full11
it's no use fielding distributor after the horse has bolted
the noise disturbed after the horse has bolted
it's no use fielding distributor after the horse has bolted
`full19
the koga then the updated molitor as part of molly does one seppuku
the koga then the updated molitor as part of mali does one seppuku
the koga then the updated molitor as part of molly does one seppuku
Question: could you please explain the behavior? is it depending on model, wave or architecture? I would expect same transcription of a file over iterations with same model...
First I thought about speaker adaptation, but this is done on recognizer level (recognizer is deleted after a file is decoded).
R
Great, thanks. I'll try to take a look.
vosk==0.3.32 vosk-model-en-us-0.22 yes,i have the same problem.i used the same audio file data,but i got the different result every times for example some times i got the result "a" and it's start time is 0.16245s , but the next time,i maybe got the result "a" start time is 0.16482s why the result are different?
@Veango There is internal slight randomization too, so results are expected to be different.
@nshmyrev so i will get the different result although the same code, same model and the same audio data?
if the reason is used random method, why don't use the same random seed so that it can return the same result
@nshmyrev Hi Nickolay, any updates or reasonable explanation for the issue? I tested also 0.3.38 and the behavior is same (0.3.42 just updates the model handler)
@nshmyrev (I haven't been active here yet, so first: thanks for your awesome work!!!)
This issue seems to me of quite some importance in order to create audio sample based unit tests for CI/CD, as recently added to JustSayIt.jl (https://github.com/omlins/JustSayIt.jl). It required quite a bit of extra effort to make sure that the tests do not fail every now and then (still one of these unit tests fails in like every 10th CI/CD run). The PR is here - if of interest: https://github.com/omlins/JustSayIt.jl/pull/56
As possible reasons for this randomization, in general, I would think of the following:
@nshmyrev : could you comment on 1. and 2.: what random seed is used, and is there any form of multi-threading used?
Hello
Kaldi adds random noise to sound to avoid numerical overflow. You can disable it with --dither=0
in model/mfcc.conf
, then the results will be reproducible. See
https://github.com/kaldi-asr/kaldi/blob/master/src/feat/feature-window.cc#L90
The system uses C++ random seed, however, it applies some extra logic on top and there is no API to control it from Vosk. See
https://github.com/kaldi-asr/kaldi/blob/master/src/base/kaldi-math.cc#L45
We might want to expose such API if you certainly need it.
Thanks @nshmyrev for the details.
We might want to expose such API if you certainly need it.
All I would need and wish for - as well as probably most or all Vosk users, including @Veango and @madkote (?)-, would be a global switch similar to Vosk.SetLogLevel(vosk_log_level)
that allows to run Vosk in "pseudo-random" mode. In this pseudo-random mode all random seeds could be set to a fixed value that would lead to reproducibility between different runs with the same audio input.
This would be really very valuable as at present it is quite challenging to create robust unit tests based on audio input and they can never be guaranteed to always succeed even if they succeed once or twice or a hundred times... The package JustSayIt.jl (https://github.com/omlins/JustSayIt.jl) that I have created has reached a complexity where covering a large amount of its functionality with unit tests has become indispensable for its further development (even more so, when others will start to contribute, which might well happen after my talk about it at the JuliaCon conference in two month...).
@omlins ok, let me implement it coming week
Was this implemented?
@lenzo-ka not yet
some update on this feature implementation?
still the only option update ./conf/mfcc.conf
adding --dither=0
?
When loading the model only once and running decoding on same file with each time new recognizer, results are different.
Setup
vosk==0.3.32
vosk-model-en-us-0.22
Test Run decoding on each file 3 times, for every run and file the new recognizer is instantiated. The model is loaded only once.
Now, results of every run is slightly DIFFERENT from another.
File:
sample_full.wav
Example 1
"... history of the world but i i i will tell you that",
"... history of the world i i i i will tell you that",
Example 2
love a good debate we love a good argument
love a good debate will have a good argument
Example 3
learned that thanks to the difficulties of this year
learned that thanks for the advice difficulties of this year
More interesting, when running the script 3 times with one run (practically the model is loaded from disk each time) results are SAME!
Question Is it possible that something is cached or not released in the recognizer / model?
Script
sample_full_results.zip files.zip