alphacep / vosk-server

WebSocket, gRPC and WebRTC speech recognition server based on Vosk and Kaldi libraries
Apache License 2.0
882 stars 243 forks source link

Websocket-gpu-batch crashing with segmentation fault: #169

Closed vinodhian closed 1 year ago

vinodhian commented 2 years ago

System details:

CUDA: 11.6 GCC: gcc (GCC) 9.3.1 20200408 (Red Hat 9.3.1-2) GPU: 16GB CPU: 64GB CPU Cores: 16 Python: 3.7.4 OS: redhat:enterprise_linux:7.9:GA:server Audio sample rate: 16000 Audio size: 10MB Language: English Model Used: vosk‑model‑en‑us‑0.22 websockets==10.2 I compiled and installed kaldi & vosk-api for GPU manually by following all the steps available in https://github.com/alphacep/vosk-api/blob/master/travis/Dockerfile.manylinux file.

Note:

  1. While running websocket-gpu-batch example, the server side(i.e., asr_server_gpu) is crashing with segmentation fault after closing the websocket.
  2. There no error observed in the client side(test.py).
  3. Segmentation fault occurs after transcription is completed and after the websocket is closed (as it can be seen in the logs below).
  4. I didn't get this error when am testing with test16k.wav file.

Please find the error details below,

[harmony@awn-hmaiml-dr00 vosk]$ python37 asr_server_gpu_orig.py
WARNING ([5.5.1027~2-59386]:SelectGpuId():cu-device.cc:243) Not in compute-exclusive mode.  Suggestion: use 'nvidia-smi -c 3' to set compute exclusive mode
LOG ([5.5.1027~2-59386]:SelectGpuIdAuto():cu-device.cc:438) Selecting from 1 GPUs
LOG ([5.5.1027~2-59386]:SelectGpuIdAuto():cu-device.cc:453) cudaSetDevice(0): Tesla T4  free:14792M, used:117M, total:14910M, free/total:0.99209
LOG ([5.5.1027~2-59386]:SelectGpuIdAuto():cu-device.cc:501) Device: 0, mem_ratio: 0.99209
LOG ([5.5.1027~2-59386]:SelectGpuId():cu-device.cc:382) Trying to select device: 0
LOG ([5.5.1027~2-59386]:SelectGpuIdAuto():cu-device.cc:511) Success selecting device 0 free mem ratio: 0.99209
LOG ([5.5.1027~2-59386]:FinalizeActiveGpu():cu-device.cc:338) The active GPU is [0]: Tesla T4   free:14432M, used:477M, total:14910M, free/total:0.967946 version 7.5
LOG ([5.5.1027~2-59386]:RemoveOrphanNodes():nnet-nnet.cc:948) Removed 0 orphan nodes.
LOG ([5.5.1027~2-59386]:RemoveOrphanComponents():nnet-nnet.cc:847) Removing 0 orphan components.
LOG ([5.5.1027~2-59386]:BatchModel():batch_model.cc:52) Loading HCLG from model/graph/HCLG.fst
LOG ([5.5.1027~2-59386]:BatchModel():batch_model.cc:56) Loading words from model/graph/words.txt
LOG ([5.5.1027~2-59386]:BatchModel():batch_model.cc:64) Loading winfo model/graph/phones/word_boundary.int
LOG ([5.5.1027~2-59386]:ComputeDerivedVars():ivector-extractor.cc:183) Computing derived variables for iVector extractor
LOG ([5.5.1027~2-59386]:ComputeDerivedVars():ivector-extractor.cc:204) Done.
LOG ([5.5.1027~2-59386]:ComputeDerivedVars():ivector-extractor.cc:183) Computing derived variables for iVector extractor
LOG ([5.5.1027~2-59386]:ComputeDerivedVars():ivector-extractor.cc:204) Done.
LOG ([5.5.1027~2-59386]:ComputeDerivedVars():ivector-extractor.cc:183) Computing derived variables for iVector extractor
LOG ([5.5.1027~2-59386]:ComputeDerivedVars():ivector-extractor.cc:204) Done.
LOG ([5.5.1027~2-59386]:ComputeDerivedVars():ivector-extractor.cc:183) Computing derived variables for iVector extractor
LOG ([5.5.1027~2-59386]:ComputeDerivedVars():ivector-extractor.cc:204) Done.
INFO:root:Listening on 0.0.0.0:8010
INFO:websockets.server:server listening on 0.0.0.0:8010
INFO:websockets.server:connection open
INFO:root:Connection from ('127.0.0.1', 44290)
INFO:root:Config {'sample_rate': 16000}
INFO:websockets.server:connection closed
Segmentation fault
[harmony@awn-hmaiml-dr00 vosk]$ 

Wav file properties of test16k.wav, which works properly:

[harmony@awn-hmaiml-dr00 vosk]$ soxi test16k.wav

Input File     : 'test16k.wav'
Channels       : 1
Sample Rate    : 16000
Precision      : 16-bit
Duration       : 00:00:08.31 = 132928 samples ~ 623.1 CDDA sectors
File Size      : 266k
Bit Rate       : 256k
Sample Encoding: 16-bit Signed Integer PCM 

Wav file properties of HSC_20210915_123429_3.wav, for which segmentation fault occurs:

[harmony@awn-hmaiml-dr00 vosk]$ soxi HSC_20210915_123429_3.wav

Input File     : 'HSC_20210915_123429_3.wav'
Channels       : 1
Sample Rate    : 16000
Precision      : 16-bit
Duration       : 00:05:27.70 = 5243200 samples ~ 24577.5 CDDA sectors
File Size      : 10.5M
Bit Rate       : 256k
Sample Encoding: 16-bit Signed Integer PCM

Kindly throw some light on this issue.

nshmyrev commented 2 years ago

Please run the server under gdb like this:

gdb --args python3 ./asr_server_gpu.py

and collect stacktrace.

vinodhian commented 2 years ago

Please run the server under gdb like this:

gdb --args python3 ./asr_server_gpu.py

and collect stacktrace.

Please find the backtrace of the segmentation fault below,

INFO:root:Listening on 0.0.0.0:8010
INFO:websockets.server:server listening on 0.0.0.0:8010
INFO:websockets.server:connection open
INFO:root:Connection from ('127.0.0.1', 55712)
INFO:root:Config {'sample_rate': 16000}
INFO:websockets.server:connection closed
Thread 2058 "python37" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffc18994700 (LWP 6796)]
0x00007fffeb03b0fc in kaldi::WordAlignLattice(fst::VectorFst<fst::ArcTpl<fst::CompactLatticeWeightTpl<fst::LatticeWeightTpl<float>, int> >, fst::VectorState<fst::ArcTpl<fst::CompactLatticeWeightTpl<fst::LatticeWeightTpl<float>, int> >, std::allocator<fst::ArcTpl<fst::CompactLatticeWeightTpl<fst::LatticeWeightTpl<float>, int> > > > > const&, kaldi::TransitionInformation const&, kaldi::WordBoundaryInfo const&, int, fst::VectorFst<fst::ArcTpl<fst::CompactLatticeWeightTpl<fst::LatticeWeightTpl<float>, int> >, fst::VectorState<fst::ArcTpl<fst::CompactLatticeWeightTpl<fst::LatticeWeightTpl<float>, int> >, std::allocator<fst::ArcTpl<fst::CompactLatticeWeightTpl<fst::LatticeWeightTpl<float>, int> > > > >*) ()
   from /usr/local/lib/python3.7/site-packages/vosk-0.3.32-py3.7.egg/vosk/libvosk.so
Missing separate debuginfos, use: debuginfo-install nvidia-driver-latest-dkms-cuda-libs-510.47.03-1.el7.x86_64
(gdb) backtrace
#0  0x00007fffeb03b0fc in kaldi::WordAlignLattice(fst::VectorFst<fst::ArcTpl<fst::CompactLatticeWeightTpl<fst::LatticeWeightTpl<float>, int> >, fst::VectorState<fst::ArcTpl<fst::CompactLatticeWeightTpl<fst::LatticeWeightTpl<float>, int> >, std::allocator<fst::ArcTpl<fst::CompactLatticeWeightTpl<fst::LatticeWeightTpl<float>, int> > > > > const&, kaldi::TransitionInformation const&, kaldi::WordBoundaryInfo const&, int, fst::VectorFst<fst::ArcTpl<fst::CompactLatticeWeightTpl<fst::LatticeWeightTpl<float>, int> >, fst::VectorState<fst::ArcTpl<fst::CompactLatticeWeightTpl<fst::LatticeWeightTpl<float>, int> >, std::allocator<fst::ArcTpl<fst::CompactLatticeWeightTpl<fst::LatticeWeightTpl<float>, int> > > > >*) ()
   from /usr/local/lib/python3.7/site-packages/vosk-0.3.32-py3.7.egg/vosk/libvosk.so
#1  0x00007fffeaefc60a in BatchRecognizer::PushLattice(fst::VectorFst<fst::ArcTpl<fst::CompactLatticeWeightTpl<fst::LatticeWeightTpl<float>, int> >, fst::VectorState<fst::ArcTpl<fst::CompactLatticeWeightTpl<fst::LatticeWeightTpl<float>, int> >, std::allocator<fst::ArcTpl<fst::CompactLatticeWeightTpl<fst::LatticeWeightTpl<float>, int> > > > >&, float) ()
   from /usr/local/lib/python3.7/site-packages/vosk-0.3.32-py3.7.egg/vosk/libvosk.so
#2  0x00007fffeaefd400 in ?? () from /usr/local/lib/python3.7/site-packages/vosk-0.3.32-py3.7.egg/vosk/libvosk.so
#3  0x00007fffeaf09ba3 in kaldi::cuda_decoder::BatchedThreadedNnet3CudaOnlinePipeline::FinalizeDecoding(int) ()
   from /usr/local/lib/python3.7/site-packages/vosk-0.3.32-py3.7.egg/vosk/libvosk.so
#4  0x00007fffeaeff34d in kaldi::cuda_decoder::ThreadPoolLightWorker::Work() ()
   from /usr/local/lib/python3.7/site-packages/vosk-0.3.32-py3.7.egg/vosk/libvosk.so
#5  0x00007fffec2fc8e0 in ?? () from /usr/local/lib/python3.7/site-packages/vosk-0.3.32-py3.7.egg/vosk/libvosk.so
#6  0x00007ffff798fea5 in start_thread (arg=0x7ffc18994700) at pthread_create.c:307
#7  0x00007ffff6fafb0d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
nshmyrev commented 2 years ago

We have just updated the image. Please try again and let us know if still broken.

nshmyrev commented 1 year ago

Feel free to reopen if needed. It should be fixed now.