alphacep / vosk-server

WebSocket, gRPC and WebRTC speech recognition server based on Vosk and Kaldi libraries
Apache License 2.0
870 stars 241 forks source link

Kaldi assertion failure with gpu server when changing models #241

Closed DmitriiMS closed 8 months ago

DmitriiMS commented 8 months ago

Hello. I use gpu version of Vosk server and I would like to be able to switch between models on the fly, mainly EN (one that comes with the docker container) and RU (vosk-model-ru-0.42). I added volume with the model to the docker container and it runs great with either model. I also modified asr_server_gpu.py so it is able to take parameter and switch model based on that:

            if 'sample_rate' in jobj:
                sample_rate = float(jobj['sample_rate'])
            if 'model' in jobj:
                model_changed = True
                model = BatchModel(jobj['model'])
            continue

        # Create the recognizer, word list is temporary disabled since not every model supports it
        if not rec or model_changed:
            model_changed = False
            rec = BatchRecognizer(model, sample_rate)

Switching models works fine for a couple of times, but it's almost guaranteed that after switching from EN model to RU model I recieve this error:

INFO:root:Config {'model': 'model-ru/'}
INFO:root:Config {'sample_rate': 16000}
LOG ([5.5.1089~1-a25f2]:RemoveOrphanNodes():nnet-nnet.cc:948) Removed 1 orphan nodes.
LOG ([5.5.1089~1-a25f2]:RemoveOrphanComponents():nnet-nnet.cc:847) Removing 2 orphan components.
LOG ([5.5.1089~1-a25f2]:Collapse():nnet-utils.cc:1488) Added 1 components, removed 2
LOG ([5.5.1089~1-a25f2]:BatchModel():batch_model.cc:52) Loading HCLG from model-ru//graph/HCLG.fst
LOG ([5.5.1089~1-a25f2]:BatchModel():batch_model.cc:56) Loading words from model-ru//graph/words.txt
LOG ([5.5.1089~1-a25f2]:BatchModel():batch_model.cc:64) Loading winfo model-ru//graph/phones/word_boundary.int
LOG ([5.5.1089~1-a25f2]:ComputeDerivedVars():ivector-extractor.cc:183) Computing derived variables for iVector extractorLOG ([5.5.1089~1-a25f2]:ComputeDerivedVars():ivector-extractor.cc:204) Done.
LOG ([5.5.1089~1-a25f2]:~CachingOptimizingCompiler():nnet-optimize.cc:710) 0.059 seconds taken in nnet3 compilation total (breakdown: 0.0353 compilation, 0.00284 optimization, 0.0184 shortcut expansion, 0.000457 checking, 8.7e-05 computing indexes, 0.00194 misc.) + 0 I/O.
ASSERTION_FAILED ([5.5.1089~1-a25f2]:GetBestPredecessor():cuda-decoder.cc:1088) Assertion failed: ((offset + i) < h_all_tokens_extra_prev_tokens_extra_and_acoustic_cost_[ichannel] .size())

[ Stack-Trace: ]
/usr/local/lib/python3.10/dist-packages/vosk-0.3.45-py3.10.egg/vosk/libvosk.so(kaldi::MessageLogger::LogMessage() const+0x80e) [0x7f1876f5a38e]
/usr/local/lib/python3.10/dist-packages/vosk-0.3.45-py3.10.egg/vosk/libvosk.so(kaldi::KaldiAssertFailure_(char const*, char const*, int, char const*)+0x75) [0x7f1876f5ade5]
/usr/local/lib/python3.10/dist-packages/vosk-0.3.45-py3.10.egg/vosk/libvosk.so(kaldi::cuda_decoder::CudaDecoder::GetBestPredecessor(int, int, int*, int*)+0x115) [0x7f1876af6df5]
/usr/local/lib/python3.10/dist-packages/vosk-0.3.45-py3.10.egg/vosk/libvosk.so(kaldi::cuda_decoder::CudaDecoder::GeneratePartialPath(int, int)+0xa0) [0x7f1876af7700]
/usr/local/lib/python3.10/dist-packages/vosk-0.3.45-py3.10.egg/vosk/libvosk.so(kaldi::cuda_decoder::CudaDecoder::ComputeH2HCopies()+0x46f) [0x7f1876afd4bf]
/usr/local/lib/python3.10/dist-packages/vosk-0.3.45-py3.10.egg/vosk/libvosk.so(kaldi::cuda_decoder::CudaDecoder::AdvanceDecoding(std::vector<std::pair<int, float const*>, std::allocator<std::pair<int, float const*> > > const&)+0x355) [0x7f1876aff525]
/usr/local/lib/python3.10/dist-packages/vosk-0.3.45-py3.10.egg/vosk/libvosk.so(kaldi::cuda_decoder::BatchedThreadedNnet3CudaOnlinePipeline::RunDecoder(std::vector<int, std::allocator<int> > const&, std::vector<bool, std::allocator<bool> > const&)+0xc9) [0x7f1876ae2569]
/usr/local/lib/python3.10/dist-packages/vosk-0.3.45-py3.10.egg/vosk/libvosk.so(kaldi::cuda_decoder::BatchedThreadedNnet3CudaOnlinePipeline::DecodeBatch(std::vector<unsigned long, std::allocator<unsigned long> > const&, std::vector<float*, std::allocator<float*> > const&, int, std::vector<int, std::allocator<int> > const&, std::vector<float*, std::allocator<float*> > const&, std::vector<bool, std::allocator<bool> > const&, std::vector<bool, std::allocator<bool> > const&, std::vector<int, std::allocator<int> >*, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const*, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const*> >*, std::vector<bool, std::allocator<bool> >*)+0xee) [0x7f1876ae519e]
/usr/local/lib/python3.10/dist-packages/vosk-0.3.45-py3.10.egg/vosk/libvosk.so(kaldi::cuda_decoder::BatchedThreadedNnet3CudaOnlinePipeline::DecodeBatch(std::vector<unsigned long, std::allocator<unsigned long> > const&, kaldi::Matrix<float> const&, std::vector<int, std::allocator<int> > const&, std::vector<bool, std::allocator<bool> > const&, std::vector<bool, std::allocator<bool> > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const*, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const*> >*, std::vector<bool, std::allocator<bool> >*)+0xcd) [0x7f1876ae52ed]
/usr/local/lib/python3.10/dist-packages/vosk-0.3.45-py3.10.egg/vosk/libvosk.so(kaldi::cuda_decoder::CudaOnlinePipelineDynamicBatcher::BatcherThreadLoop()+0x229) [0x7f1876af2ba9]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc2b3) [0x7f1839ab22b3]
/usr/lib/x86_64-linux-gnu/libc.so.6(+0x94b43) [0x7f187b75eb43]
/usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x44) [0x7f187b7efbb4]

Aborted (core dumped)

I understand that this is a Kaldi error, but maybe I am doing something wrong with Vosk server?

Using several docker containers is pretty hard, because they compete for the only GPU I have (it doesn't have MIG and other options are pretty hard to imlement). Loading multiple BatchModels leads to segmentation fault (which is expected). Non gpu server works fine in such usecase, but it's pretty slow.

Is there anything I can do to make this dynamic switching work? Should I take this issue to kaldi repo? Or is it better to implement a workaround, e.g. killing server inside container and relaunching it with new model?

nshmyrev commented 8 months ago

Switching between models is not going to work fast anyway. I'd better emulate two cards on a single nvidia card and run two dockers. Something like

https://docs.nvidia.com/grid/13.0/grid-vgpu-user-guide/index.html

DmitriiMS commented 8 months ago

It takes abou 3-5 seconds to switch models, in my case it was acceptable. I figured out that I can run 3 servers with different ports open inside docker container. It works fine. Instaling vGPU drivers would be harder imo. I'm closing the issue.