alphacep / vosk-server

WebSocket, gRPC and WebRTC speech recognition server based on Vosk and Kaldi libraries
Apache License 2.0
868 stars 241 forks source link

vosk-server-gpu Segmentation fault (core dumped) #224

Open cdgraff opened 1 year ago

cdgraff commented 1 year ago

Hi! Can help me to identify what i'm doing wrong that after some transcriptions i got an Segmentation fault (core dumped)

I sent audio chunks of 30 seconds to transcribe, one after the other... in some cases we split into multiple workers as you can see bellow, using 3 workers.

The path for the server is created dynamically to be uniq by chunk.

root@f66db52da9e0:/opt/vosk-server/websocket-gpu-batch# python3 ./asr_server_gpu.py 
WARNING ([5.5.1089~1-a25f2]:SelectGpuId():cu-device.cc:243) Not in compute-exclusive mode.  Suggestion: use 'nvidia-smi -c 3' to set compute exclusive mode
LOG ([5.5.1089~1-a25f2]:SelectGpuIdAuto():cu-device.cc:438) Selecting from 1 GPUs
LOG ([5.5.1089~1-a25f2]:SelectGpuIdAuto():cu-device.cc:453) cudaSetDevice(0): Tesla T4  free:14791M, used:118M, total:14910M, free/total:0.992023
LOG ([5.5.1089~1-a25f2]:SelectGpuIdAuto():cu-device.cc:501) Device: 0, mem_ratio: 0.992023
LOG ([5.5.1089~1-a25f2]:SelectGpuId():cu-device.cc:382) Trying to select device: 0
LOG ([5.5.1089~1-a25f2]:SelectGpuIdAuto():cu-device.cc:511) Success selecting device 0 free mem ratio: 0.992023
LOG ([5.5.1089~1-a25f2]:FinalizeActiveGpu():cu-device.cc:338) The active GPU is [0]: Tesla T4   free:14455M, used:454M, total:14910M, free/total:0.969489 version 7.5
LOG ([5.5.1089~1-a25f2]:RemoveOrphanNodes():nnet-nnet.cc:948) Removed 0 orphan nodes.
LOG ([5.5.1089~1-a25f2]:RemoveOrphanComponents():nnet-nnet.cc:847) Removing 0 orphan components.
LOG ([5.5.1089~1-a25f2]:BatchModel():batch_model.cc:52) Loading HCLG from model/graph/HCLG.fst
LOG ([5.5.1089~1-a25f2]:BatchModel():batch_model.cc:56) Loading words from model/graph/words.txt
LOG ([5.5.1089~1-a25f2]:BatchModel():batch_model.cc:64) Loading winfo model/graph/phones/word_boundary.int
LOG ([5.5.1089~1-a25f2]:ComputeDerivedVars():ivector-extractor.cc:183) Computing derived variables for iVector extractor
LOG ([5.5.1089~1-a25f2]:ComputeDerivedVars():ivector-extractor.cc:204) Done.
server listening on 0.0.0.0:2700
INFO:websockets.server:server listening on 0.0.0.0:2700
connection open
INFO:websockets.server:connection open
INFO:root:Connection from ('34.30.88.55', 49442)
INFO:root:Config {'words': 1, 'sample_rate': 16000}
connection closed
INFO:websockets.server:connection closed
connection open
INFO:websockets.server:connection open
INFO:root:Connection from ('34.30.88.55', 49456)
INFO:root:Config {'words': 1, 'sample_rate': 16000}
connection open
INFO:websockets.server:connection open
INFO:root:Connection from ('35.224.62.142', 60480)
INFO:root:Config {'words': 1, 'sample_rate': 16000}
connection open
INFO:websockets.server:connection open
INFO:root:Connection from ('34.134.42.203', 40700)
INFO:root:Config {'words': 1, 'sample_rate': 16000}
connection closed
INFO:websockets.server:connection closed
Segmentation fault (core dumped)
root@f66db52da9e0:/opt/vosk-server/websocket-gpu-batch# gdb python3 core 

Type "apropos word" to search for commands related to "word"...
Reading symbols from python3...
Reading symbols from /usr/lib/debug/.build-id/14/8e086667839ef13939196984d6f717c331bd76.debug...

warning: Can't open file /dev/zero (deleted) during file-backed mapping note processing
[New LWP 2564]
[New LWP 2563]
[New LWP 2568]
[New LWP 2046]
[New LWP 2049]
[New LWP 2562]
[New LWP 2565]
[New LWP 2566]
[New LWP 2045]
[New LWP 2567]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `python3 ./asr_server_gpu.py'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fc915ea7b36 in BatchRecognizer::PushLattice(fst::VectorFst<fst::ArcTpl<fst::CompactLatticeWeightTpl<fst::LatticeWeightTpl<float>, int> >, fst::VectorState<fst::ArcTpl<fst::CompactLatticeWeightTpl<fst::LatticeWeightTpl<float>, int> >, std::allocator<fst::ArcTpl<fst::CompactLatticeWeightTpl<fst::LatticeWeightTpl<float>, int> > > > >&, float) () from /usr/local/lib/python3.10/dist-packages/vosk-0.3.45-py3.10.egg/vosk/libvosk.so
[Current thread is 1 (Thread 0x7fc67e5d2640 (LWP 2564))]
(gdb) 
(gdb) bt
#0  0x00007fc915ea7b36 in BatchRecognizer::PushLattice(fst::VectorFst<fst::ArcTpl<fst::CompactLatticeWeightTpl<fst::LatticeWeightTpl<float>, int> >, fst::VectorState<fst::ArcTpl<fst::CompactLatticeWeightTpl<fst::LatticeWeightTpl<float>, int> >, std::allocator<fst::ArcTpl<fst::CompactLatticeWeightTpl<fst::LatticeWeightTpl<float>, int> > > > >&, float) () from /usr/local/lib/python3.10/dist-packages/vosk-0.3.45-py3.10.egg/vosk/libvosk.so
#1  0x00007fc915eb9f81 in kaldi::cuda_decoder::BatchedThreadedNnet3CudaOnlinePipeline::FinalizeDecoding(int) () from /usr/local/lib/python3.10/dist-packages/vosk-0.3.45-py3.10.egg/vosk/libvosk.so
#2  0x00007fc915eae5a5 in kaldi::cuda_decoder::ThreadPoolLightWorker::Work() () from /usr/local/lib/python3.10/dist-packages/vosk-0.3.45-py3.10.egg/vosk/libvosk.so
#3  0x00007fc8d8cb22b3 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007fc91ab2eb43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#5  0x00007fc91abbfbb4 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:100
cdgraff commented 1 year ago

Hi @nshmyrev do you have some advice? is something wrong in my setup? I tested with CPU, same model, and works without issue, with same implementation of code... but with GPU arrive to same issue in all the tests... thanks in advance!

nshmyrev commented 1 year ago

Do you close connection before closing results without sending eof? I need to reproduce this thing somehow.

connection closed

message worries me

fdipilla commented 1 year ago

Hi @nshmyrev, I'm working with @cdgraff on this particular implementation, we are using a nodejs PassThrough to read the ffmpg output and feed it via websocket to the Vosk server. Our logs from node look something similar to this:

starting
sending chunk
... <- a bunch of chunks
sending chunk
sending chunk
sending eof
closing websocket

Let me know if this answers your question. Thanks!

GianvitoBono commented 3 months ago

Hi!

I'm having the same issues, but I'm using the python lib. I'm using the asr_server_gpu.py from thus repo with running in Docker (using this image: alphacep/kaldi-vosk-server-gpu:latest).

From my debugging the problem occour when we start to close the websocket connection and the function FinishStream() of the BatchRecognizer get called.

Here is the error:

[Thread 0x73ae8288c640 (LWP 1751) exited]
[Thread 0x73ae5affd640 (LWP 1752) exited]
LOG ([5.5.1089~1-a25f2]:ComputeDerivedVars():ivector-extractor.cc:204) Done.
[New Thread 0x73ae5affd640 (LWP 1753)]
[New Thread 0x73ae8288c640 (LWP 1754)]
[New Thread 0x73ae80888640 (LWP 1755)]
[New Thread 0x73ae8188a640 (LWP 1756)]
[New Thread 0x73b012fde640 (LWP 1757)]
[New Thread 0x73ae8208b640 (LWP 1758)]
[New Thread 0x73ae81089640 (LWP 1759)]
[New Thread 0x73ae5bfff640 (LWP 1760)]
[New Thread 0x73ae5b7fe640 (LWP 1761)]
[New Thread 0x73ae58db1640 (LWP 1762)]
[New Thread 0x73ae3cb69640 (LWP 1763)]
INFO:websockets.server:server listening on 0.0.0.0:2700
INFO:websockets.server:connection open
INFO:root:Connection from ('10.36.2.192', 35730)
INFO:root:Config {'sample_rate': 16000}
INFO:websockets.server:connection closed

Thread 523 "python3" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x73ae80888640 (LWP 1755)]
0x000073b08a937b36 in BatchRecognizer::PushLattice(fst::VectorFst<fst::ArcTpl<fst::CompactLatticeWeightTpl<fst::LatticeWeightTpl<float>, int> >, fst::VectorState<fst::ArcTpl<fst::CompactLatticeWeightTpl<fst::LatticeWeightTpl<float>, int> >, std::allocator<fst::ArcTpl<fst::CompactLatticeWeightTpl<fst::LatticeWeightTpl<float>, int> > > > >&, float) () from /usr/local/lib/python3.10/dist-packages/vosk-0.3.45-py3.10.egg/vosk/libvosk.so

Here the backtrace took from gdb:

(gdb) bt
#0  0x000073b08a937b36 in BatchRecognizer::PushLattice(fst::VectorFst<fst::ArcTpl<fst::CompactLatticeWeightTpl<fst::LatticeWeightTpl<float>, int> >, fst::VectorState<fst::ArcTpl<fst::CompactLatticeWeightTpl<fst::LatticeWeightTpl<float>, int> >, std::allocator<fst::ArcTpl<fst::CompactLatticeWeightTpl<fst::LatticeWeightTpl<float>, int> > > > >&, float) () from /usr/local/lib/python3.10/dist-packages/vosk-0.3.45-py3.10.egg/vosk/libvosk.so
#1  0x000073b08a949f81 in kaldi::cuda_decoder::BatchedThreadedNnet3CudaOnlinePipeline::FinalizeDecoding(int) () from /usr/local/lib/python3.10/dist-packages/vosk-0.3.45-py3.10.egg/vosk/libvosk.so
#2  0x000073b08a93e5a5 in kaldi::cuda_decoder::ThreadPoolLightWorker::Work() () from /usr/local/lib/python3.10/dist-packages/vosk-0.3.45-py3.10.egg/vosk/libvosk.so
#3  0x000073b04d6b22b3 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x000073b08f5bdac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#5  0x000073b08f64ea04 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:100

If we delete the FinishStream() call the server works, but the used memory increase really fast without ever going down, I think is bacause the memory user by the recognizer never get released.

I tried to implement the same think starting from the C++ server (but using the batch model and recognizer) and the same error occour.

Can you help me solve this issue?

Thanks!

nshmyrev commented 3 months ago

There is race condition in kaldi here:

https://github.com/kaldi-asr/kaldi/blob/master/src/cudadecoder/batched-threaded-nnet3-cuda-online-pipeline.cc#L574

I'll try to fix coming days.

GianvitoBono commented 3 months ago

Wonderful! Thanks for the fast reply :)