alphacep / vosk-api

Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node
Apache License 2.0
7.79k stars 1.09k forks source link

vosk can become very slow when spoken language is different from model #660

Open philipag opened 3 years ago

philipag commented 3 years ago

When e.g. using vosk-model-en-us-daanzu-20200905 on a modern CPU (single thread), performance can be up to about 15x realtime. However, when the spoken language switches to something different from the model (and possibly with background noise) I see Vosk sometimes slowing down to .1x realtime or less. Is it possible to avoid this slow-down scenario somehow? If it is useful I can collect some audio samples that exhibit this but my guess is that this is fundamental to the implementation.

nshmyrev commented 3 years ago

We need an audio sample.

philipag commented 3 years ago

The issue is a bit more complicated and shows itself especially when running multiple threads on problematic data. The following test project (with sample.wav file) exhibits the problem. It runs the maximum CPU supported threads and should therefore cause close to 100% CPU usage for a CPU bound algorithm (which Vosk is since it is not using the GPU). However, CPU usage drops to as low as 8% while processing sample.wav. My guess is that Vosk is locking some shared resource which becomes a bottleneck and starves the threads.

This is what CPU usage looks like (I have a 7900X 10 core 20 thread processor on my machine): cpuUsage

Test project: vosk2ThreadProject.zip

philipag commented 3 years ago

@nshmyrev Was the sample sufficient for replicating this on your side?

nshmyrev commented 3 years ago

@philipag sorry, I didn't check yet. It will take me some time.

Most likely it is memory bounded, I doubt there is any lock.

You can probably try our newer model 0.21 and latest vosk 0.3.31, it should be more stable and more accurate too.

You could probably also try to reduce beams to smaller value in model.conf and see what happens.

One can also print lattice size for every result to see how it changes during decoding. I suspect big lattice size with your noisy audio makes a difference. Maybe some kaldi implementation is slow, sometimes you should see warning messages about problem with determinization.

philipag commented 3 years ago

@nshmyrev After updating to Vosk 3.31 the problem became much better so it seems Visk 3.31 fixed/improved something related to this issue. Both vosk-model-en-us-daanzu-20200905 (same model I used previously) and vosk-model-en-us-0.21 exhibit the problem similarly (although daanzu runs about 1.8x faster), so the issue does not appear to be model related.

Here is CPU usage for vosk-model-en-us-0.21: vosk2

Still not 100% CPU as expected but the dips are much higher. The issue does not appear to be memory related as my system has lots of free RAM when executing the sample and I cannot imagine that a memory bandwidth issue would exhibit itself like this - especially since I have quad channel memory.

I see no warnings being logged and although reducing beam width would improve performance overall it still seems to me there is some other contention causing some threads to wait idle at times.

philipag commented 3 years ago

@nshmyrev Changing e.g. vosk-model-en-us-0.21 to "--beam=10.0" makes the issue worse again so there is definitely still an issue - it is just highly dependent on data and settings. Note that I stop the application when the first thread completes so the dip at the end happens when all threads are still running.

vosk3

madkote commented 2 years ago

@nshmyrev I found this issue by accident and run attached files (on the top of thread) with:

my test: run decoding on each file 3 times, for every run and files the new recognizer is instantiated. the model is loaded only once.

Now, results of every run is slightly DIFFERENT from another.

File: sample_full.wav

Example 1

Example 2

Example 3

More interesting, when running the script 3 times with one run (practically the model is loaded from disk each time) results are SAME!

Question Is it possible that something is cached or not released in the recognizer / model?

Script

import json
import os
import vosk  # @UnresolvedImport

MODEL_VOSK = 'vosk-model-en-us-0.22'
FOLDER_WAV = 'files'

def vosk_shared(model_name: str, runid: str, num_runs: int=1) -> None:
    print()
    print('--- VOSK SHARED :: %s :: %s' % (model_name, runid))
    path_root = os.path.abspath(os.path.dirname(__file__))
    path_wav = os.path.join(path_root, FOLDER_WAV)
    path_mdl = os.path.join(path_root, model_name)
    _frame_ms = 100
    sample_rate = 16_000
    # vosk.SetLogLevel(-1)
    model = vosk.Model(path_mdl)
    chunk_size = (_frame_ms * sample_rate // 1000) * 2
    try:
        for n in range(num_runs):
            c = n + 1
            print('>>> RUN #%s' % c)
            if num_runs == 1:
                _name = 'vosk-%s' % (runid)
            else:
                _name = 'vosk-%s_%s' % (runid, c)
            path_out = os.path.join(path_root, _name)
            if not os.path.exists(path_out):
                os.mkdir(path_out)
            #
            #
            for wf in sorted(os.listdir(path_wav)):
                if not wf.endswith('.wav'):
                    continue
                print(wf)
                wff = os.path.join(path_wav, wf)
                results = []
                rec = vosk.KaldiRecognizer(model, sample_rate)
                rec.SetWords(True)
                try:
                    with open(wff, 'rb') as f:
                        while True:
                            data = f.read(chunk_size)
                            if len(data) == 0:
                                break
                            if rec.AcceptWaveform(data):
                                r = json.loads(rec.Result())
                                if r['text']:
                                    results.append(r['text'])
                        r = json.loads(rec.FinalResult())
                        if r['text']:
                            results.append(r['text'])
                finally:
                    del rec
                results = [r.strip() for r in results if r.strip()]
                file_out = os.path.join(path_out, wf.replace('.wav', '.json'))
                with open(file_out, 'w') as fout:
                    json.dump(dict(filename=wf, decoding=results), fout, indent=2)
    finally:
        del model

def main():
    vosk_shared(MODEL_VOSK, 'shared', num_runs=3)

if __name__ == '__main__':
    main()

sample_full_results.zip

madkote commented 2 years ago

sample_full_results.zip

nshmyrev commented 2 years ago

@nshmyrev I found this issue by accident and run attached files (https://github.com/alphacep/vosk-api/issues/309) with:

Is this a separate issue? You'd better report it in a different place.