Different results on same audio

madkote commented 2 years ago

When loading the model only once and running decoding on same file with each time new recognizer, results are different.

Setup

vosk==0.3.32
vosk-model-en-us-0.22

Test Run decoding on each file 3 times, for every run and file the new recognizer is instantiated. The model is loaded only once.

Now, results of every run is slightly DIFFERENT from another.

File: sample_full.wav

Example 1

Run 1 "... history of the world but i i i will tell you that",
Run 2 "... history of the world i i i i will tell you that",

Example 2

Run 2 love a good debate we love a good argument
Run 3 love a good debate will have a good argument

Example 3

Run 2 learned that thanks to the difficulties of this year
Run 3 learned that thanks for the advice difficulties of this year

More interesting, when running the script 3 times with one run (practically the model is loaded from disk each time) results are SAME!

Question Is it possible that something is cached or not released in the recognizer / model?

Script

import json
import os
import vosk  # @UnresolvedImport

MODEL_VOSK = 'vosk-model-en-us-0.22'
FOLDER_WAV = 'files'

def vosk_shared(model_name: str, runid: str, num_runs: int=1) -> None:
    print()
    print('--- VOSK SHARED :: %s :: %s' % (model_name, runid))
    path_root = os.path.abspath(os.path.dirname(__file__))
    path_wav = os.path.join(path_root, FOLDER_WAV)
    path_mdl = os.path.join(path_root, model_name)
    _frame_ms = 100
    sample_rate = 16_000
    # vosk.SetLogLevel(-1)
    model = vosk.Model(path_mdl)
    chunk_size = (_frame_ms * sample_rate // 1000) * 2
    try:
        for n in range(num_runs):
            c = n + 1
            print('>>> RUN #%s' % c)
            if num_runs == 1:
                _name = 'vosk-%s' % (runid)
            else:
                _name = 'vosk-%s_%s' % (runid, c)
            path_out = os.path.join(path_root, _name)
            if not os.path.exists(path_out):
                os.mkdir(path_out)
            #
            #
            for wf in sorted(os.listdir(path_wav)):
                if not wf.endswith('.wav'):
                    continue
                print(wf)
                wff = os.path.join(path_wav, wf)
                results = []
                rec = vosk.KaldiRecognizer(model, sample_rate)
                rec.SetWords(True)
                try:
                    with open(wff, 'rb') as f:
                        while True:
                            data = f.read(chunk_size)
                            if len(data) == 0:
                                break
                            if rec.AcceptWaveform(data):
                                r = json.loads(rec.Result())
                                if r['text']:
                                    results.append(r['text'])
                        r = json.loads(rec.FinalResult())
                        if r['text']:
                            results.append(r['text'])
                finally:
                    del rec
                results = [r.strip() for r in results if r.strip()]
                file_out = os.path.join(path_out, wf.replace('.wav', '.json'))
                with open(file_out, 'w') as fout:
                    json.dump(dict(filename=wf, decoding=results), fout, indent=2)
    finally:
        del model

def main():
    vosk_shared(MODEL_VOSK, 'shared', num_runs=3)

if __name__ == '__main__':
    main()

sample_full_results.zip files.zip

nshmyrev commented 2 years ago

We store ivector for speaker adaptation per channel, so the different results are possible.

madkote commented 2 years ago

We store ivector for speaker adaptation per channel, so the different results are possible.

Hi @nshmyrev , thanks for reply! Speak adaptation can explain the difference only when I would not delete the channel explicitly (see del rec).

For each file a new channel is created with rec = vosk.KaldiRecognizer(...) and deleted once the file is processed. In the next iteration, same files are transcribed with a "private" channel for each file. So the speaker adaptation would only possible for within a channel.

But the issue is, same audio file transcribed different with its "private" channel. Only the model is shared in the memory.

madkote commented 2 years ago

I agree, the audio sample is not the best to show the issue. Let me record some audio, where it could also happen.

Side note: the effect of different transcription is only seen when the audio is a bit noisy (while driving car, phone call with some background). But I there is nothing extreme, where the decoder should be confused. Since 1 of 3 iterated decodings is wrong/different - and I would like to understand why it happens.

Also, when the model is completely reloaded (rerun the script), transcriptions are always same. So the model may cache something - which sound too strange to me.

The effect is visible on small and big en-US models.

madkote commented 2 years ago

I agree, the audio sample is not the best to show the issue. Let me record some audio, where it could also happen.

Side note: the effect of different transcription is only seen when the audio is a bit noisy (while driving car, phone call with some background). But I there is nothing extreme, where the decoder should be confused. Since 1 of 3 iterated decodings is wrong/different - and I would like to understand why it happens.

Also, when the model is completely reloaded (rerun the script), transcriptions are always same. So the model may cache something - which sound too strange to me.

The effect is visible on small and big en-US models.

madkote commented 2 years ago

Hi @nshmyrev , it would be nice to know your opinion and feedback on this issue.

attached are 20 waves fullaudios.zip (16khz, multiple sentences in each wave, >30sec total).

vosk==0.3.32
vosk-model-en-us-0.22

As I wrote above, before decoding the model is loaded once. Then the loop runs 3 times (or more if needed) , where in each iteration over files list new decoder is created to decode EACH single file.

The issue: transcription of the same file might change after iteration. Note, recognizer is created and destroyed for each file in each iteration.

Results (some examples and most interesting is full8 - all three transcriptions are different)

full1
- iteration 1 the theory
- iteration 2 he be alright
- iteration 3 he be alright
full4
- iteration 1 that speed bart is among his best known what
- iteration 2 that speed bart is among his best known works
- iteration 3 that speed bart is among his best known works
full8
- iteration 1 they were mostly used to home express fruit thompson some hold suburban passenger trains
- iteration 2 they were mostly used to home express fruit and the soul hold suburban passenger trains
- iteration 3 they were mostly used to home express fruit and the hold suburban passenger trains
`full11
- iteration 1 it's no use fielding distributor after the horse has bolted
- iteration 2 the noise disturbed after the horse has bolted
- iteration 3 it's no use fielding distributor after the horse has bolted
`full19
- iteration 1 the koga then the updated molitor as part of molly does one seppuku
- iteration 2 the koga then the updated molitor as part of mali does one seppuku
- iteration 3 the koga then the updated molitor as part of molly does one seppuku

Question: could you please explain the behavior? is it depending on model, wave or architecture? I would expect same transcription of a file over iterations with same model...

First I thought about speaker adaptation, but this is done on recognizer level (recognizer is deleted after a file is decoded).

R

nshmyrev commented 2 years ago

Great, thanks. I'll try to take a look.

Veango commented 2 years ago

vosk==0.3.32 vosk-model-en-us-0.22 yes,i have the same problem.i used the same audio file data,but i got the different result every times for example some times i got the result "a" and it's start time is 0.16245s , but the next time,i maybe got the result "a" start time is 0.16482s why the result are different?

nshmyrev commented 2 years ago

@Veango There is internal slight randomization too, so results are expected to be different.

Veango commented 2 years ago

@nshmyrev so i will get the different result although the same code, same model and the same audio data?

if the reason is used random method, why don't use the same random seed so that it can return the same result

madkote commented 2 years ago

@nshmyrev Hi Nickolay, any updates or reasonable explanation for the issue? I tested also 0.3.38 and the behavior is same (0.3.42 just updates the model handler)

omlins commented 2 years ago

@nshmyrev (I haven't been active here yet, so first: thanks for your awesome work!!!)

This issue seems to me of quite some importance in order to create audio sample based unit tests for CI/CD, as recently added to JustSayIt.jl (https://github.com/omlins/JustSayIt.jl). It required quite a bit of extra effort to make sure that the tests do not fail every now and then (still one of these unit tests fails in like every 10th CI/CD run). The PR is here - if of interest: https://github.com/omlins/JustSayIt.jl/pull/56

As possible reasons for this randomization, in general, I would think of the following:

usage of a different random seed every time (e.g., based on current time) (as mentioned by @Veango)
some form of parallelism that leads to a different order of certain computations (e.g., reductions with multiple threads - even if not run on multiple cores, but with context switching)

@nshmyrev : could you comment on 1. and 2.: what random seed is used, and is there any form of multi-threading used?

nshmyrev commented 2 years ago

Hello

Kaldi adds random noise to sound to avoid numerical overflow. You can disable it with --dither=0 in model/mfcc.conf, then the results will be reproducible. See

https://github.com/kaldi-asr/kaldi/blob/master/src/feat/feature-window.cc#L90

The system uses C++ random seed, however, it applies some extra logic on top and there is no API to control it from Vosk. See

https://github.com/kaldi-asr/kaldi/blob/master/src/base/kaldi-math.cc#L45

We might want to expose such API if you certainly need it.

omlins commented 2 years ago

Thanks @nshmyrev for the details.

We might want to expose such API if you certainly need it.

All I would need and wish for - as well as probably most or all Vosk users, including @Veango and @madkote (?)-, would be a global switch similar to Vosk.SetLogLevel(vosk_log_level) that allows to run Vosk in "pseudo-random" mode. In this pseudo-random mode all random seeds could be set to a fixed value that would lead to reproducibility between different runs with the same audio input.

This would be really very valuable as at present it is quite challenging to create robust unit tests based on audio input and they can never be guaranteed to always succeed even if they succeed once or twice or a hundred times... The package JustSayIt.jl (https://github.com/omlins/JustSayIt.jl) that I have created has reached a complexity where covering a large amount of its functionality with unit tests has become indispensable for its further development (even more so, when others will start to contribute, which might well happen after my talk about it at the JuliaCon conference in two month...).

nshmyrev commented 2 years ago

@omlins ok, let me implement it coming week

lenzo-ka commented 2 years ago

Was this implemented?

nshmyrev commented 2 years ago

@lenzo-ka not yet

cdgraff commented 1 year ago

some update on this feature implementation? still the only option update ./conf/mfcc.conf adding --dither=0?

alphacep / vosk-api

Different results on same audio #868