Support full GPU decoding

hairyone commented 4 years ago

When I compile you app inside a docker container without GPU support everything works fine.

I have made a few changes so that Kaldi is compiled with GPU support and I am running the application inside a docker container with NVIDIA GPU support.

But when I run the GPU version I get the error below, do you have any idea what the problem might be?

server --min-active=200 --max-active=6000 --beam=13.0 --lattice-beam=6.0 --acoustic-scale=1.0 --frame-subsampling-factor=3 --endpoint.silence-phones=1:2:3:4:5:6:7:8:9:10 --endpoint.rule2.min-trailing-silence=0.5 --endpoint.rule3.min-trailing-silence=1.0 --endpoint.rule4.min-trailing-silence=2.0
LOG (server[5.5]:Model():model.cc:47) Sample rate is 8000
LOG (server[5.5]:ComputeDerivedVars():ivector-extractor.cc:183) Computing derived variables for iVector extractor
LOG (server[5.5]:ComputeDerivedVars():ivector-extractor.cc:204) Done.
LOG (server[5.5]:RemoveOrphanNodes():nnet-nnet.cc:948) Removed 1 orphan nodes.
LOG (server[5.5]:RemoveOrphanComponents():nnet-nnet.cc:847) Removing 2 orphan components.
LOG (server[5.5]:Collapse():nnet-utils.cc:1472) Added 1 components, removed 2
LOG (server[5.5]:CompileLooped():nnet-compile-looped.cc:345) Spent 0.0133111 seconds in looped compilation.
WARNING (server[5.5]:SelectGpuId():cu-device.cc:228) Not in compute-exclusive mode.  Suggestion: use 'nvidia-smi -c 3' to set compute exclusive mode
LOG (server[5.5]:SelectGpuIdAuto():cu-device.cc:408) Selecting from 1 GPUs
LOG (server[5.5]:SelectGpuIdAuto():cu-device.cc:423) cudaSetDevice(0): GeForce GTX 1070 free:8022M, used:97M, total:8119M, free/total:0.987992
LOG (server[5.5]:SelectGpuIdAuto():cu-device.cc:471) Device: 0, mem_ratio: 0.987992
LOG (server[5.5]:SelectGpuId():cu-device.cc:352) Trying to select device: 0
LOG (server[5.5]:SelectGpuIdAuto():cu-device.cc:481) Success selecting device 0 free mem ratio: 0.987992
LOG (server[5.5]:FinalizeActiveGpu():cu-device.cc:308) The active GPU is [0]: GeForce GTX 1070  free:7834M, used:285M, total:8119M, free/total:0.964838 version 6.1
ERROR (server[5.5]:CopyToMat():cu-matrix.cc:464) cudaError_t 700 : "an illegal memory access was encountered" returned from 'cudaMemcpy2DAsync(dst->Data(), dst_pitch, this->data_, src_pitch, width, this->num_rows_, cudaMemcpyDeviceToHost, cudaStreamPerThread)'

[ Stack-Trace: ]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::MessageLogger::LogMessage() const+0x82c) [0x7f852b86c2de]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)+0x2e) [0x7f852b458f3c]
/opt/kaldi-websocket/_kaldi_recognizer.so(void kaldi::CuMatrixBase<float>::CopyToMat<float>(kaldi::MatrixBase<float>*, kaldi::MatrixTransposeType) const+0x1ea) [0x7f852b78bb52]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::CuMatrix<float>::Swap(kaldi::Matrix<float>*)+0x12f) [0x7f852b78cba5]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::Matrix<float>::Swap(kaldi::CuMatrix<float>*)+0x12) [0x7f852b78cc24]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::nnet3::DecodableNnetLoopedOnlineBase::AdvanceChunk()+0x5a3) [0x7f852b5bbe6f]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::nnet3::DecodableAmNnetLoopedOnline::LogLikelihood(int, int)+0x51) [0x7f852b5bc047]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::LatticeFasterDecoderTpl<fst::VectorFst<fst::ArcTpl<fst::TropicalWeightTpl<float> >, fst::VectorState<fst::ArcTpl<fst::TropicalWeightTpl<float> >, std::allocator<fst::ArcTpl<fst::TropicalWeightTpl<float> > > > >, kaldi::decoder::BackpointerToken>::ProcessEmitting(kaldi::DecodableInterface*)+0x22b) [0x7f852b4c2173]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::LatticeFasterDecoderTpl<fst::VectorFst<fst::ArcTpl<fst::TropicalWeightTpl<float> >, fst::VectorState<fst::ArcTpl<fst::TropicalWeightTpl<float> >, std::allocator<fst::ArcTpl<fst::TropicalWeightTpl<float> > > > >, kaldi::decoder::BackpointerToken>::AdvanceDecoding(kaldi::DecodableInterface*, int)+0x97) [0x7f852b4c2575]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::LatticeFasterDecoderTpl<fst::Fst<fst::ArcTpl<fst::TropicalWeightTpl<float> > >, kaldi::decoder::BackpointerToken>::AdvanceDecoding(kaldi::DecodableInterface*, int)+0x74) [0x7f852b4c30c8]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::SingleUtteranceNnet3DecoderTpl<fst::Fst<fst::ArcTpl<fst::TropicalWeightTpl<float> > > >::AdvanceDecoding()+0x19) [0x7f852b4a1e8f]
/opt/kaldi-websocket/_kaldi_recognizer.so(KaldiRecognizer::AcceptWaveform(char const*, int)+0x10a) [0x7f852b457d8a]
/opt/kaldi-websocket/_kaldi_recognizer.so(+0x2c29c4) [0x7f852b4909c4]
python3(PyCFunction_Call+0x4f) [0x4e12df]
python3(PyEval_EvalFrameEx+0x614) [0x530b94]
python3(PyEval_EvalFrameEx+0x4b64) [0x5350e4]
python3(PyEval_EvalCodeEx+0x13b) [0x53a81b]
python3() [0x4e3537]
python3(PyObject_Call+0x47) [0x5c3bd7]
python3(PyEval_EvalFrameEx+0x24a2) [0x532a22]
python3(PyEval_EvalFrameEx+0x4b64) [0x5350e4]
python3(PyEval_EvalCodeEx+0x13b) [0x53a81b]
python3() [0x4e3537]
python3(PyObject_Call+0x47) [0x5c3bd7]
python3(PyEval_EvalFrameEx+0x24a2) [0x532a22]
python3(PyEval_EvalFrameEx+0x4b64) [0x5350e4]
python3(PyEval_EvalFrameEx+0x4b64) [0x5350e4]
python3(PyEval_EvalCodeEx+0x13b) [0x53a81b]
python3() [0x4e3423]
python3(PyObject_Call+0x47) [0x5c3bd7]
python3() [0x4f08be]
python3(PyObject_Call+0x47) [0x5c3bd7]
python3(PyEval_CallObjectWithKeywords+0x30) [0x525d00]
python3() [0x626bb2]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f852f8816ba]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f852f5b741d]

terminate called after throwing an instance of 'kaldi::KaldiFatalError'

nshmyrev commented 4 years ago

Hi

Kaldi decoders are not supposed to work with GPU this way. Decoding is CPU-bound due to search, you can't get too much advances from using GPU.

The only way you can make it fast is to use Batched GPU decoder, but that is another story.

What is the goal you are trying to achieve actually?

hairyone commented 4 years ago

I added a function to init the GPU

void KaldiRecognizer::InitGpu() {
    kaldi::CuDevice::Instantiate().SelectGpuId("yes");
    kaldi::CuDevice::Instantiate().AllowMultithreading();
}

Which is called here ...

async def recognize(websocket, path):
    rec = KaldiRecognizer(model);
    rec.InitGpu()
    while True:
        message = await websocket.recv()
        response, stop = await loop.run_in_executor(pool, process_chunk, rec, message)
        await websocket.send(response)
        if stop: break

and here is the amended makefile

KALDI_ROOT ?= $(HOME)/kaldi

CXX := g++

ATLASLIBS := /usr/lib/libatlas.so.3 /usr/lib/libf77blas.so.3 /usr/lib/libcblas.so.3 /usr/lib/liblapack_atlas.so.3

KALDI_FLAGS := \
    -DKALDI_DOUBLEPRECISION=0 -DHAVE_POSIX_MEMALIGN \
    -Wno-sign-compare -Wno-unused-local-typedefs -Winit-self \
    -DHAVE_EXECINFO_H=1 -rdynamic -DHAVE_CXXABI_H -DHAVE_ATLAS \
    -I$(KALDI_ROOT)/tools/ATLAS/include \
    -I$(KALDI_ROOT)/tools/openfst/include -I$(KALDI_ROOT)/src

CUDA_FLAGS := \
    -DHAVE_CUDA=1 -I/usr/local/cuda/include

CXXFLAGS := -std=c++11 -g -Wall -DPIC -fPIC $(KALDI_FLAGS) $(CUDA_FLAGS) `pkg-config --cflags python3`

KALDI_LIBS = \
    -rdynamic -Wl,-rpath=$(KALDI_ROOT)/tools/openfst/lib \
    $(KALDI_ROOT)/src/online2/kaldi-online2.a \
    $(KALDI_ROOT)/src/decoder/kaldi-decoder.a \
    $(KALDI_ROOT)/src/ivector/kaldi-ivector.a \
    $(KALDI_ROOT)/src/gmm/kaldi-gmm.a \
    $(KALDI_ROOT)/src/nnet3/kaldi-nnet3.a \
    $(KALDI_ROOT)/src/tree/kaldi-tree.a \
    $(KALDI_ROOT)/src/feat/kaldi-feat.a \
    $(KALDI_ROOT)/src/lat/kaldi-lat.a \
    $(KALDI_ROOT)/src/hmm/kaldi-hmm.a \
    $(KALDI_ROOT)/src/transform/kaldi-transform.a \
    $(KALDI_ROOT)/src/cudamatrix/kaldi-cudamatrix.a \
    $(KALDI_ROOT)/src/matrix/kaldi-matrix.a \
    $(KALDI_ROOT)/src/fstext/kaldi-fstext.a \
    $(KALDI_ROOT)/src/util/kaldi-util.a \
    $(KALDI_ROOT)/src/base/kaldi-base.a \
    -L $(KALDI_ROOT)/tools/openfst/lib -lfst \
    $(ATLASLIBS) \
    `pkg-config --libs python3` \
    -lm -lpthread

CUDA_LIBS := \
    -Wl,-rpath=/usr/local/cuda/lib64 \
    -Wl,-rpath=/usr/lib/x86_64-linux-gnu \
    -L /usr/local/cuda/lib64 \
    -L /usr/lib/x86_64-linux-gnu \
    -lcublas -lcusparse -lcudart -lcurand -lcufft -lnvToolsExt -lcusolver

all: _kaldi_recognizer.so

_kaldi_recognizer.so: kaldi_recognizer_wrap.cc kaldi_recognizer.cc model.cc
    $(CXX) $(CXXFLAGS) -shared -o $@ kaldi_recognizer.cc model.cc kaldi_recognizer_wrap.cc $(KALDI_LIBS) $(CUDA_LIBS)

kaldi_recognizer_wrap.cc: kaldi_recognizer.i
    swig -threads -python -c++ -o kaldi_recognizer_wrap.cc kaldi_recognizer.i

clean:
    $(RM) *.so kaldi_recognizer_wrap.cc *.o *.pyc kaldi_recognizer.py
~

hairyone commented 4 years ago

I was told by a colleague that by compiling kaldi with GPU support the matrix operations would be done in parallel on the GPU yielding a marginal performance improvement or at least taking some of the load off the CPUs.

hairyone commented 4 years ago

Just for completeness here is my docker file

# FROM ubuntu:16.04
# FROM debian:9.8
# FROM nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04
# FROM nvidia/cuda:9.0-devel-ubuntu16.04
# FROM nvidia/cuda:10.0-cudnn7-devel-ubuntu16.04
  FROM nvidia/cuda:10.2-devel-ubuntu16.04

################################################################################
# get the packages we need
################################################################################

RUN apt-get update \
&&  apt-get install -y --no-install-recommends \
       g++ make automake autoconf bzip2 unzip wget libtool git subversion \
       sox python2.7 python3 python3-dev python3-websockets pkg-config \
       zlib1g-dev patch libatlas-dev libxml2 ca-certificates swig \
       libatlas3-base vim \
&&  rm -rf /var/lib/apt/lists/*

################################################################################
# install cuda
################################################################################
# ARG CUDA_VERSION=cuda_8.0.61_375.26_linux-run
# ARG CUDA_VERSION=cuda_10.0.130_410.48_linux.run
# ARG CUDA_VERSION=cuda_10.1.243_418.87.00_linux.run
# ARG CUDA_VERSION=cuda_10.2.89_440.33.01_linux.run
# ADD ${CUDA_VERSION} /opt/cuda/${CUDA_VERSION}
# RUN cd /opt/cuda \
# &&  sh ${CUDA_VERSION} --silent --toolkit --samples \
# &&  rm ${CUDA_VERSION}

# ADD NVIDIA_CUDA-8.0_Samples /root/NVIDIA_CUDA-8.0_Samples

################################################################################
# compile and install kaldi
################################################################################
ADD kaldi-master /opt/kaldi

RUN cd /opt/kaldi \
&&  cd /opt/kaldi/tools \
&&  make -j $(nproc) \
&&  cd /opt/kaldi/src \
&&  ./configure --mathlib=ATLAS --shared \
&&  make depend -j $(nproc) \
&&  make -j $(nproc) online2 \
&&  find /opt/kaldi -name "*.o" | xargs rm

################################################################################
# compile the kaldi_recogniser shared library with python binding
################################################################################
ADD kaldi-websocket-python-master /opt/kaldi-websocket

RUN cd /opt/kaldi-websocket \
&&  KALDI_ROOT=/opt/kaldi make

# &&  cd /opt/kaldi/src \
# &&  make clean

################################################################################
# install the language model
################################################################################
ADD model-en-f1 /opt/kaldi-en/model

################################################################################
# server config
################################################################################
EXPOSE 2700
WORKDIR /opt/kaldi-websocket
CMD [ "python3", "./asr_server.py", "/opt/kaldi-en/model" ]

nshmyrev commented 4 years ago

It is not going to work this way because search takes > 60% of time on CPU and GPU will be just waiting for CPU to finalize.

You need to wait till https://github.com/kaldi-asr/kaldi/pull/3568 lands into kaldi, it is a work in progress currently.

If you need faster processing it is more straightforward to tune beams, compile with MKL and use smaller model.

hairyone commented 4 years ago

It is not going to work this way because search takes > 60% of time on CPU and GPU will be just waiting for CPU to finalize.

You need to wait till kaldi-asr/kaldi#3568 will land into kaldi, it is a work in progress currently.

If you need faster processing it is more straightforward to tune beams, compile with MKL and use smaller model.

Thanks for the response Nick :)

Without me realising you have probably worked with my colleague Nazim. His comment was that he had added GPU support to the kaldi-gstreamer implementation and he did see a difference.

I prefer your implementation to the kaldi-gstreamer one so i wanted to add GPU support so I could compare them side by side for a similar number of streams.

I was told that compiling kaldi with GPU support was largely transparent to the user in the sense that if it was compiled with GPU support then certain matrix operations would moved to the GPU so I am surprised at the error above.

nshmyrev commented 4 years ago

I see, cool! Greetings to Nazim!

Well, the issue you see is due to multithread access I suppose, there are multiple worker threads and device needs some blocking. Gstreamer uses processes so it is easier to access the memory.

I would start with CUDA_LAUNCH_BLOCKING environment variable, probably it will fix the concurrency issue.

I might look on this issue a little bit later.

hairyone commented 4 years ago

Thanks Nick,

I will pass on a hello to Nazim.

Regarding the thread access ... at the moment I am only sending single stream in which case I would not expect to see any concurrency issues. Nevetherless I will test with CUDA_LAUNCH_BLOCKING and report back.

Again thanks for responding so promptly. If you do decide to investigate and you need anything from me, please let me know.

nshmyrev commented 4 years ago

There is still a worker pool with many threads and they can work simultaneously, see the python code. I suspect it is the case. I'll let you know.

hairyone commented 4 years ago

When I run with CUDA_LAUNCH_BLOCKING there was more error info:

server --min-active=200 --max-active=6000 --beam=13.0 --lattice-beam=6.0 --acoustic-scale=1.0 --frame-subsampling-factor=3 --endpoint.silence-phones=1:2:3:4:5:6:7:8:9:10 --endpoint.rule2.min-trailing-silence=0.5 --endpoint.rule3.min-trailing-silence=1.0 --endpoint.rule4.min-trailing-silence=2.0
LOG (server[5.5]:Model():model.cc:47) Sample rate is 8000
LOG (server[5.5]:ComputeDerivedVars():ivector-extractor.cc:183) Computing derived variables for iVector extractor
LOG (server[5.5]:ComputeDerivedVars():ivector-extractor.cc:204) Done.
LOG (server[5.5]:RemoveOrphanNodes():nnet-nnet.cc:948) Removed 1 orphan nodes.
LOG (server[5.5]:RemoveOrphanComponents():nnet-nnet.cc:847) Removing 2 orphan components.
LOG (server[5.5]:Collapse():nnet-utils.cc:1472) Added 1 components, removed 2
LOG (server[5.5]:CompileLooped():nnet-compile-looped.cc:345) Spent 0.0146899 seconds in looped compilation.
WARNING (server[5.5]:SelectGpuId():cu-device.cc:228) Not in compute-exclusive mode.  Suggestion: use 'nvidia-smi -c 3' to set compute exclusive mode
LOG (server[5.5]:SelectGpuIdAuto():cu-device.cc:408) Selecting from 1 GPUs
LOG (server[5.5]:SelectGpuIdAuto():cu-device.cc:423) cudaSetDevice(0): GeForce GTX 1070 free:8022M, used:97M, total:8119M, free/total:0.987992
LOG (server[5.5]:SelectGpuIdAuto():cu-device.cc:471) Device: 0, mem_ratio: 0.987992
LOG (server[5.5]:SelectGpuId():cu-device.cc:352) Trying to select device: 0
LOG (server[5.5]:SelectGpuIdAuto():cu-device.cc:481) Success selecting device 0 free mem ratio: 0.987992
LOG (server[5.5]:FinalizeActiveGpu():cu-device.cc:308) The active GPU is [0]: GeForce GTX 1070  free:7834M, used:285M, total:8119M, free/total:0.964838 version 6.1
ERROR (server[5.5]:CopyRows():cu-matrix.cc:2691) cudaError_t 700 : "an illegal memory access was encountered" returned from 'cudaGetLastError()'

[ Stack-Trace: ]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::MessageLogger::LogMessage() const+0x82c) [0x7f3e6f2492de]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)+0x2e) [0x7f3e6ee35f3c]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::CuMatrixBase<float>::CopyRows(kaldi::CuMatrixBase<float> const&, kaldi::CuArrayBase<int> const&)+0x251) [0x7f3e6f145e5b]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::nnet3::NnetComputer::ExecuteCommand()+0xb1f) [0x7f3e6ef88ab1]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::nnet3::NnetComputer::Run()+0x18a) [0x7f3e6ef89582]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::nnet3::DecodableNnetLoopedOnlineBase::AdvanceChunk()+0x4a8) [0x7f3e6ef98d74]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::nnet3::DecodableAmNnetLoopedOnline::LogLikelihood(int, int)+0x51) [0x7f3e6ef99047]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::LatticeFasterDecoderTpl<fst::VectorFst<fst::ArcTpl<fst::TropicalWeightTpl<float> >, fst::VectorState<fst::ArcTpl<fst::TropicalWeightTpl<float> >, std::allocator<fst::ArcTpl<fst::TropicalWeightTpl<float> > > > >, kaldi::decoder::BackpointerToken>::ProcessEmitting(kaldi::DecodableInterface*)+0x22b) [0x7f3e6ee9f173]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::LatticeFasterDecoderTpl<fst::VectorFst<fst::ArcTpl<fst::TropicalWeightTpl<float> >, fst::VectorState<fst::ArcTpl<fst::TropicalWeightTpl<float> >, std::allocator<fst::ArcTpl<fst::TropicalWeightTpl<float> > > > >, kaldi::decoder::BackpointerToken>::AdvanceDecoding(kaldi::DecodableInterface*, int)+0x97) [0x7f3e6ee9f575]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::LatticeFasterDecoderTpl<fst::Fst<fst::ArcTpl<fst::TropicalWeightTpl<float> > >, kaldi::decoder::BackpointerToken>::AdvanceDecoding(kaldi::DecodableInterface*, int)+0x74) [0x7f3e6eea00c8]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::SingleUtteranceNnet3DecoderTpl<fst::Fst<fst::ArcTpl<fst::TropicalWeightTpl<float> > > >::AdvanceDecoding()+0x19) [0x7f3e6ee7ee8f]
/opt/kaldi-websocket/_kaldi_recognizer.so(KaldiRecognizer::AcceptWaveform(char const*, int)+0x10a) [0x7f3e6ee34d8a]
/opt/kaldi-websocket/_kaldi_recognizer.so(+0x2c29c4) [0x7f3e6ee6d9c4]
python3(PyCFunction_Call+0x4f) [0x4e12df]
python3(PyEval_EvalFrameEx+0x614) [0x530b94]
python3(PyEval_EvalFrameEx+0x4b64) [0x5350e4]
python3(PyEval_EvalCodeEx+0x13b) [0x53a81b]
python3() [0x4e3537]
python3(PyObject_Call+0x47) [0x5c3bd7]
python3(PyEval_EvalFrameEx+0x24a2) [0x532a22]
python3(PyEval_EvalFrameEx+0x4b64) [0x5350e4]
python3(PyEval_EvalCodeEx+0x13b) [0x53a81b]
python3() [0x4e3537]
python3(PyObject_Call+0x47) [0x5c3bd7]
python3(PyEval_EvalFrameEx+0x24a2) [0x532a22]
python3(PyEval_EvalFrameEx+0x4b64) [0x5350e4]
python3(PyEval_EvalFrameEx+0x4b64) [0x5350e4]
python3(PyEval_EvalCodeEx+0x13b) [0x53a81b]
python3() [0x4e3423]
python3(PyObject_Call+0x47) [0x5c3bd7]
python3() [0x4f08be]
python3(PyObject_Call+0x47) [0x5c3bd7]
python3(PyEval_CallObjectWithKeywords+0x30) [0x525d00]
python3() [0x626bb2]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f3e7325e6ba]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f3e72f9441d]

WARNING (server[5.5]:ExecuteCommand():nnet-compute.cc:436) Printing some background info since error was detected
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:437) matrix m1(50, 40), m2(3, 100), m3(46, 220), m4(46, 1024), m5(15, 4096), m6(15, 1024), m7(13, 3072), m8(13, 1024), m9(11, 3072), m10(11, 1024), m11(9, 3072), m12(9, 1024), m13(7, 3072), m14(7, 1024), m15(7, 1024), m16(7, 8850), m17(21, 40), m18(1, 100), m19(21, 220), m20(21, 1024), m21(7, 4096), m22(7, 3072), m23(7, 3072), m24(7, 3072), m25(7, 3072), m26(7, 1024), m27(7, 1024), m28(7, 8850), m29(21, 40), m30(1, 100), m31(21, 220), m32(21, 1024), m33(7, 4096), m34(7, 3072), m35(7, 3072), m36(7, 3072), m37(7, 3072), m38(7, 1024), m39(7, 1024), m40(7, 8850), m41(21, 40), m42(1, 100)
# The following show how matrices correspond to network-nodes and
# cindex-ids.  Format is: matrix = <node-id>.[value|deriv][ <list-of-cindex-ids> ]
# where a cindex-id is written as (n,t[,x]) but ranges of t values are compressed
# so we write (n, tfirst:tlast).
m1 == value: input[(0,-17:32)]
m2 == value: ivector[(0,-21), (0,0), (0,21)]
m3 == value: Tdnn_0_affine_input[(0,-16:29)]
m4 == value: Tdnn_0_affine[(0,-16:29)]
m5 == value: Tdnn_1_affine_input[(0,-15), (0,-12), (0,-9), (0,-6), (0,-3), (0,0), (0,3), (0,6), (0,9), (0,12), (0,15), (0,18), (0,21), (0,24), (0,27)]
m6 == value: Tdnn_1_affine[(0,-15), (0,-12), (0,-9), (0,-6), (0,-3), (0,0), (0,3), (0,6), (0,9), (0,12), (0,15), (0,18), (0,21), (0,24), (0,27)]
m7 == value: Tdnn_2_affine_input[(0,-12), (0,-9), (0,-6), (0,-3), (0,0), (0,3), (0,6), (0,9), (0,12), (0,15), (0,18), (0,21), (0,24)]
m8 == value: Tdnn_2_affine[(0,-12), (0,-9), (0,-6), (0,-3), (0,0), (0,3), (0,6), (0,9), (0,12), (0,15), (0,18), (0,21), (0,24)]
m9 == value: Tdnn_3_affine_input[(0,-9), (0,-6), (0,-3), (0,0), (0,3), (0,6), (0,9), (0,12), (0,15), (0,18), (0,21)]
m10 == value: Tdnn_3_affine[(0,-9), (0,-6), (0,-3), (0,0), (0,3), (0,6), (0,9), (0,12), (0,15), (0,18), (0,21)]
m11 == value: Tdnn_4_affine_input[(0,-6), (0,-3), (0,0), (0,3), (0,6), (0,9), (0,12), (0,15), (0,18)]
m12 == value: Tdnn_4_affine[(0,-6), (0,-3), (0,0), (0,3), (0,6), (0,9), (0,12), (0,15), (0,18)]
m13 == value: Tdnn_5_affine_input[(0,0), (0,3), (0,6), (0,9), (0,12), (0,15), (0,18)]
m14 == value: Tdnn_5_affine[(0,0), (0,3), (0,6), (0,9), (0,12), (0,15), (0,18)]
m15 == value: Tdnn_pre_final_chain_affine[(0,0), (0,3), (0,6), (0,9), (0,12), (0,15), (0,18)]
m16 == value: Final_affine[(0,0), (0,3), (0,6), (0,9), (0,12), (0,15), (0,18)]
m17 == value: input[(0,33:53)]
m18 == value: ivector[(0,42)]
m19 == value: Tdnn_0_affine_input[(0,30:50)]
m20 == value: Tdnn_0_affine[(0,30:50)]
m21 == value: Tdnn_1_affine_input[(0,30), (0,33), (0,36), (0,39), (0,42), (0,45), (0,48)]
m22 == value: Tdnn_2_affine_input[(0,27), (0,30), (0,33), (0,36), (0,39), (0,42), (0,45)]
m23 == value: Tdnn_3_affine_input[(0,24), (0,27), (0,30), (0,33), (0,36), (0,39), (0,42)]
m24 == value: Tdnn_4_affine_input[(0,21), (0,24), (0,27), (0,30), (0,33), (0,36), (0,39)]
m25 == value: Tdnn_5_affine_input[(0,21), (0,24), (0,27), (0,30), (0,33), (0,36), (0,39)]
m26 == value: Tdnn_5_affine[(0,21), (0,24), (0,27), (0,30), (0,33), (0,36), (0,39)]
m27 == value: Tdnn_pre_final_chain_affine[(0,21), (0,24), (0,27), (0,30), (0,33), (0,36), (0,39)]
m28 == value: Final_affine[(0,21), (0,24), (0,27), (0,30), (0,33), (0,36), (0,39)]
m29 == value: input[(0,54:74)]
m30 == value: ivector[(0,63)]
m31 == value: Tdnn_0_affine_input[(0,51:71)]
m32 == value: Tdnn_0_affine[(0,51:71)]
m33 == value: Tdnn_1_affine_input[(0,51), (0,54), (0,57), (0,60), (0,63), (0,66), (0,69)]
m34 == value: Tdnn_2_affine_input[(0,48), (0,51), (0,54), (0,57), (0,60), (0,63), (0,66)]
m35 == value: Tdnn_3_affine_input[(0,45), (0,48), (0,51), (0,54), (0,57), (0,60), (0,63)]
m36 == value: Tdnn_4_affine_input[(0,42), (0,45), (0,48), (0,51), (0,54), (0,57), (0,60)]
m37 == value: Tdnn_5_affine_input[(0,42), (0,45), (0,48), (0,51), (0,54), (0,57), (0,60)]
m38 == value: Tdnn_5_affine[(0,42), (0,45), (0,48), (0,51), (0,54), (0,57), (0,60)]
m39 == value: Tdnn_pre_final_chain_affine[(0,42), (0,45), (0,48), (0,51), (0,54), (0,57), (0,60)]
m40 == value: Final_affine[(0,42), (0,45), (0,48), (0,51), (0,54), (0,57), (0,60)]
m41 == value: input[(0,75:95)]
m42 == value: ivector[(0,84)]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c0: m1 = user input [for node: 'input']
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c1: m2 = user input [for node: 'ivector']
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c2: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c3: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c4: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c5: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c6: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c7: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c8: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c9: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c10: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c11: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c12: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c13: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c14: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c15: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c16: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c17: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c18: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c19: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c20: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c21: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c22: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c23: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c24: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c25: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c26: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c27: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c28: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c29: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c30: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c31: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c32: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c33: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c34: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c35: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c36: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c37: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c38: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c39: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c40: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c41: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c42: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c43: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c44: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c45: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c46: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c47: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c48: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c49: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c50: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c51: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c52: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c53: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c54: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c55: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c56: [no-op-permanent]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c57: m3 = undefined(46,220)
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c58: m3(0:45, 0:39) = m1(0:45, 0:39)
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c59: m3(0:45, 40:79) = m1(1:46, 0:39)
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c60: m3(0:45, 80:119) = m1(2:47, 0:39)
ERROR (server[5.5]:ExecuteCommand():nnet-compute.cc:443) Error running command c61: m3(0:45, 120:219).CopyRows(1, m2[0x16, 1x21, 2x9])

[ Stack-Trace: ]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::MessageLogger::LogMessage() const+0x82c) [0x7f3e6f2492de]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)+0x2e) [0x7f3e6ee35f3c]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::nnet3::NnetComputer::ExecuteCommand()+0x13d1) [0x7f3e6ef89363]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::nnet3::NnetComputer::Run()+0x18a) [0x7f3e6ef89582]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::nnet3::DecodableNnetLoopedOnlineBase::AdvanceChunk()+0x4a8) [0x7f3e6ef98d74]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::nnet3::DecodableAmNnetLoopedOnline::LogLikelihood(int, int)+0x51) [0x7f3e6ef99047]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::LatticeFasterDecoderTpl<fst::VectorFst<fst::ArcTpl<fst::TropicalWeightTpl<float> >, fst::VectorState<fst::ArcTpl<fst::TropicalWeightTpl<float> >, std::allocator<fst::ArcTpl<fst::TropicalWeightTpl<float> > > > >, kaldi::decoder::BackpointerToken>::ProcessEmitting(kaldi::DecodableInterface*)+0x22b) [0x7f3e6ee9f173]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::LatticeFasterDecoderTpl<fst::VectorFst<fst::ArcTpl<fst::TropicalWeightTpl<float> >, fst::VectorState<fst::ArcTpl<fst::TropicalWeightTpl<float> >, std::allocator<fst::ArcTpl<fst::TropicalWeightTpl<float> > > > >, kaldi::decoder::BackpointerToken>::AdvanceDecoding(kaldi::DecodableInterface*, int)+0x97) [0x7f3e6ee9f575]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::LatticeFasterDecoderTpl<fst::Fst<fst::ArcTpl<fst::TropicalWeightTpl<float> > >, kaldi::decoder::BackpointerToken>::AdvanceDecoding(kaldi::DecodableInterface*, int)+0x74) [0x7f3e6eea00c8]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::SingleUtteranceNnet3DecoderTpl<fst::Fst<fst::ArcTpl<fst::TropicalWeightTpl<float> > > >::AdvanceDecoding()+0x19) [0x7f3e6ee7ee8f]
/opt/kaldi-websocket/_kaldi_recognizer.so(KaldiRecognizer::AcceptWaveform(char const*, int)+0x10a) [0x7f3e6ee34d8a]
/opt/kaldi-websocket/_kaldi_recognizer.so(+0x2c29c4) [0x7f3e6ee6d9c4]
python3(PyCFunction_Call+0x4f) [0x4e12df]
python3(PyEval_EvalFrameEx+0x614) [0x530b94]
python3(PyEval_EvalFrameEx+0x4b64) [0x5350e4]
python3(PyEval_EvalCodeEx+0x13b) [0x53a81b]
python3() [0x4e3537]
python3() [0x4e3537]
python3(PyObject_Call+0x47) [0x5c3bd7]
python3(PyEval_EvalFrameEx+0x24a2) [0x532a22]
python3(PyEval_EvalFrameEx+0x4b64) [0x5350e4]
python3(PyEval_EvalCodeEx+0x13b) [0x53a81b]
python3() [0x4e3537]
python3(PyObject_Call+0x47) [0x5c3bd7]
python3(PyEval_EvalFrameEx+0x24a2) [0x532a22]
python3(PyEval_EvalFrameEx+0x4b64) [0x5350e4]
python3(PyEval_EvalFrameEx+0x4b64) [0x5350e4]
python3(PyEval_EvalCodeEx+0x13b) [0x53a81b]
python3() [0x4e3423]
python3(PyObject_Call+0x47) [0x5c3bd7]
python3() [0x4f08be]
python3(PyObject_Call+0x47) [0x5c3bd7]
python3(PyEval_CallObjectWithKeywords+0x30) [0x525d00]
python3() [0x626bb2]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f3e7325e6ba]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f3e72f9441d]

terminate called after throwing an instance of 'kaldi::KaldiFatalError'
  what():  kaldi::KaldiFatalError

hairyone commented 4 years ago

It is not going to work this way because search takes > 60% of time on CPU and GPU will be just waiting for CPU to finalize.

You need to wait till kaldi-asr/kaldi#3568 lands into kaldi, it is a work in progress currently.

If you need faster processing it is more straightforward to tune beams, compile with MKL and use smaller model.

Here's what Nazim said :)

Yes GPU will wait but what is wrong with that I don't understand. It is just doing some of the computation to help. And the idea is to send a lot of requests to GPU from different streams. If matrix multiplication is done on CPU lets say in 1 seconds but on GPU 0.1 seconds, you will do do it in 0.9 seconds less time.

nshmyrev commented 4 years ago

This thing is probably missing https://devblogs.nvidia.com/cuda-pro-tip-always-set-current-device-avoid-multithreading-bugs/

YunzhaoLu commented 4 years ago

I added a function to init the GPU

void KaldiRecognizer::InitGpu() {
    kaldi::CuDevice::Instantiate().SelectGpuId("yes");
    kaldi::CuDevice::Instantiate().AllowMultithreading();
}

Which is called here ...

async def recognize(websocket, path):
    rec = KaldiRecognizer(model);
    rec.InitGpu()
    while True:
        message = await websocket.recv()
        response, stop = await loop.run_in_executor(pool, process_chunk, rec, message)
        await websocket.send(response)
        if stop: break

and here is the amended makefile

KALDI_ROOT ?= $(HOME)/kaldi

CXX := g++

ATLASLIBS := /usr/lib/libatlas.so.3 /usr/lib/libf77blas.so.3 /usr/lib/libcblas.so.3 /usr/lib/liblapack_atlas.so.3

KALDI_FLAGS := \
    -DKALDI_DOUBLEPRECISION=0 -DHAVE_POSIX_MEMALIGN \
    -Wno-sign-compare -Wno-unused-local-typedefs -Winit-self \
    -DHAVE_EXECINFO_H=1 -rdynamic -DHAVE_CXXABI_H -DHAVE_ATLAS \
    -I$(KALDI_ROOT)/tools/ATLAS/include \
    -I$(KALDI_ROOT)/tools/openfst/include -I$(KALDI_ROOT)/src

CUDA_FLAGS := \
    -DHAVE_CUDA=1 -I/usr/local/cuda/include

CXXFLAGS := -std=c++11 -g -Wall -DPIC -fPIC $(KALDI_FLAGS) $(CUDA_FLAGS) `pkg-config --cflags python3`

KALDI_LIBS = \
    -rdynamic -Wl,-rpath=$(KALDI_ROOT)/tools/openfst/lib \
    $(KALDI_ROOT)/src/online2/kaldi-online2.a \
    $(KALDI_ROOT)/src/decoder/kaldi-decoder.a \
    $(KALDI_ROOT)/src/ivector/kaldi-ivector.a \
    $(KALDI_ROOT)/src/gmm/kaldi-gmm.a \
    $(KALDI_ROOT)/src/nnet3/kaldi-nnet3.a \
    $(KALDI_ROOT)/src/tree/kaldi-tree.a \
    $(KALDI_ROOT)/src/feat/kaldi-feat.a \
    $(KALDI_ROOT)/src/lat/kaldi-lat.a \
    $(KALDI_ROOT)/src/hmm/kaldi-hmm.a \
    $(KALDI_ROOT)/src/transform/kaldi-transform.a \
    $(KALDI_ROOT)/src/cudamatrix/kaldi-cudamatrix.a \
    $(KALDI_ROOT)/src/matrix/kaldi-matrix.a \
    $(KALDI_ROOT)/src/fstext/kaldi-fstext.a \
    $(KALDI_ROOT)/src/util/kaldi-util.a \
    $(KALDI_ROOT)/src/base/kaldi-base.a \
    -L $(KALDI_ROOT)/tools/openfst/lib -lfst \
    $(ATLASLIBS) \
    `pkg-config --libs python3` \
    -lm -lpthread

CUDA_LIBS := \
    -Wl,-rpath=/usr/local/cuda/lib64 \
    -Wl,-rpath=/usr/lib/x86_64-linux-gnu \
    -L /usr/local/cuda/lib64 \
    -L /usr/lib/x86_64-linux-gnu \
    -lcublas -lcusparse -lcudart -lcurand -lcufft -lnvToolsExt -lcusolver

all: _kaldi_recognizer.so

_kaldi_recognizer.so: kaldi_recognizer_wrap.cc kaldi_recognizer.cc model.cc
    $(CXX) $(CXXFLAGS) -shared -o $@ kaldi_recognizer.cc model.cc kaldi_recognizer_wrap.cc $(KALDI_LIBS) $(CUDA_LIBS)

kaldi_recognizer_wrap.cc: kaldi_recognizer.i
    swig -threads -python -c++ -o kaldi_recognizer_wrap.cc kaldi_recognizer.i

clean:
    $(RM) *.so kaldi_recognizer_wrap.cc *.o *.pyc kaldi_recognizer.py
~

I can work this way:

void KaldiRecognizer::InitGpu() { kaldi::CuDevice::Instantiate().SelectGpuId("yes"); kaldi::CuDevice::Instantiate().AllowMultithreading(); }

I think the problem is here:

async def recognize(websocket, path): rec = KaldiRecognizer(model); rec.InitGpu() while True: message = await websocket.recv() response, stop = await loop.run_in_executor(pool, process_chunk, rec, message) await websocket.send(response) if stop: break

I had problem before because "initGPU" might be called multiple times. I just make sure that "initGPU" is called only once!

hairyone commented 4 years ago

As suggested I amended to the code so that the GPU initialisation is called just once before the server is started.

async def recognize(websocket, path):
    rec = KaldiRecognizer(model);
    while True:
        message = await websocket.recv()
        response, stop = await loop.run_in_executor(pool, process_chunk, rec, message)
        await websocket.send(response)
        if stop: break

gpu = Gpu()
gpu.Init()

start_server = websockets.serve(
    recognize, '0.0.0.0', 2700)

loop.run_until_complete(start_server)
loop.run_forever()
~

#include "gpu.h"

Gpu::Gpu() { }

void Gpu::Init() {
    kaldi::CuDevice::Instantiate().SelectGpuId("yes");
    kaldi::CuDevice::Instantiate().AllowMultithreading();
}

Gpu::~Gpu() { }

However I still get the error below:

server --min-active=200 --max-active=6000 --beam=13.0 --lattice-beam=6.0 --acoustic-scale=1.0 --frame-subsampling-factor=3 --endpoint.silence-phones=1:2:3:4:5:6:7:8:9:10 --endpoint.rule2.min-trailing-silence=0.5 --endpoint.rule3.min-trailing-silence=1.0 --endpoint.rule4.min-trailing-silence=2.0
LOG (server[5.5]:Model():model.cc:47) Sample rate is 8000
LOG (server[5.5]:ComputeDerivedVars():ivector-extractor.cc:183) Computing derived variables for iVector extractor
LOG (server[5.5]:ComputeDerivedVars():ivector-extractor.cc:204) Done.
LOG (server[5.5]:RemoveOrphanNodes():nnet-nnet.cc:948) Removed 1 orphan nodes.
LOG (server[5.5]:RemoveOrphanComponents():nnet-nnet.cc:847) Removing 2 orphan components.
LOG (server[5.5]:Collapse():nnet-utils.cc:1472) Added 1 components, removed 2
LOG (server[5.5]:CompileLooped():nnet-compile-looped.cc:345) Spent 0.0140841 seconds in looped compilation.
WARNING (server[5.5]:SelectGpuId():cu-device.cc:228) Not in compute-exclusive mode.  Suggestion: use 'nvidia-smi -c 3' to set compute exclusive mode
LOG (server[5.5]:SelectGpuIdAuto():cu-device.cc:408) Selecting from 1 GPUs
LOG (server[5.5]:SelectGpuIdAuto():cu-device.cc:423) cudaSetDevice(0): GeForce GTX 1070 free:8022M, used:97M, total:8119M, free/total:0.987992
LOG (server[5.5]:SelectGpuIdAuto():cu-device.cc:471) Device: 0, mem_ratio: 0.987992
LOG (server[5.5]:SelectGpuId():cu-device.cc:352) Trying to select device: 0
LOG (server[5.5]:SelectGpuIdAuto():cu-device.cc:481) Success selecting device 0 free mem ratio: 0.987992
LOG (server[5.5]:FinalizeActiveGpu():cu-device.cc:308) The active GPU is [0]: GeForce GTX 1070  free:7834M, used:285M, total:8119M, free/total:0.964838 version 6.1
ERROR (server[5.5]:CopyRows():cu-matrix.cc:2691) cudaError_t 700 : "an illegal memory access was encountered" returned from 'cudaGetLastError()'

[ Stack-Trace: ]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::MessageLogger::LogMessage() const+0x82c) [0x7f91b4683282]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)+0x2e) [0x7f91b426f1cc]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::CuMatrixBase<float>::CopyRows(kaldi::CuMatrixBase<float> const&, kaldi::CuArrayBase<int> const&)+0x251) [0x7f91b457fdff]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::nnet3::NnetComputer::ExecuteCommand()+0xb1f) [0x7f91b43c2a55]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::nnet3::NnetComputer::Run()+0x18a) [0x7f91b43c3526]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::nnet3::DecodableNnetLoopedOnlineBase::AdvanceChunk()+0x4a8) [0x7f91b43d2d18]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::nnet3::DecodableAmNnetLoopedOnline::LogLikelihood(int, int)+0x51) [0x7f91b43d2feb]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::LatticeFasterDecoderTpl<fst::VectorFst<fst::ArcTpl<fst::TropicalWeightTpl<float> >, fst::VectorState<fst::ArcTpl<fst::TropicalWeightTpl<float> >, std::allocator<fst::ArcTpl<fst::TropicalWeightTpl<float> > > > >, kaldi::decoder::BackpointerToken>::ProcessEmitting(kaldi::DecodableInterface*)+0x22b) [0x7f91b42d9117]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::LatticeFasterDecoderTpl<fst::VectorFst<fst::ArcTpl<fst::TropicalWeightTpl<float> >, fst::VectorState<fst::ArcTpl<fst::TropicalWeightTpl<float> >, std::allocator<fst::ArcTpl<fst::TropicalWeightTpl<float> > > > >, kaldi::decoder::BackpointerToken>::AdvanceDecoding(kaldi::DecodableInterface*, int)+0x97) [0x7f91b42d9519]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::LatticeFasterDecoderTpl<fst::Fst<fst::ArcTpl<fst::TropicalWeightTpl<float> > >, kaldi::decoder::BackpointerToken>::AdvanceDecoding(kaldi::DecodableInterface*, int)+0x74) [0x7f91b42da06c]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::SingleUtteranceNnet3DecoderTpl<fst::Fst<fst::ArcTpl<fst::TropicalWeightTpl<float> > > >::AdvanceDecoding()+0x19) [0x7f91b42b8e33]
/opt/kaldi-websocket/_kaldi_recognizer.so(KaldiRecognizer::AcceptWaveform(char const*, int)+0x10a) [0x7f91b426e01a]
/opt/kaldi-websocket/_kaldi_recognizer.so(+0x2c353f) [0x7f91b42a753f]
python3(PyCFunction_Call+0x4f) [0x4e12df]
python3(PyEval_EvalFrameEx+0x614) [0x530b94]
python3(PyEval_EvalFrameEx+0x4b64) [0x5350e4]
python3(PyEval_EvalCodeEx+0x13b) [0x53a81b]
python3() [0x4e3537]
python3(PyObject_Call+0x47) [0x5c3bd7]
python3(PyEval_EvalFrameEx+0x24a2) [0x532a22]
python3(PyEval_EvalFrameEx+0x4b64) [0x5350e4]
python3(PyEval_EvalCodeEx+0x13b) [0x53a81b]
python3() [0x4e3537]
python3(PyObject_Call+0x47) [0x5c3bd7]
python3(PyEval_EvalFrameEx+0x24a2) [0x532a22]
python3(PyEval_EvalFrameEx+0x4b64) [0x5350e4]
python3(PyEval_EvalFrameEx+0x4b64) [0x5350e4]
python3(PyEval_EvalCodeEx+0x13b) [0x53a81b]
python3() [0x4e3423]
python3(PyObject_Call+0x47) [0x5c3bd7]
python3() [0x4f08be]
python3(PyObject_Call+0x47) [0x5c3bd7]
python3(PyEval_CallObjectWithKeywords+0x30) [0x525d00]
python3() [0x626bb2]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f91b86996ba]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f91b83cf41d]

WARNING (server[5.5]:ExecuteCommand():nnet-compute.cc:436) Printing some background info since error was detected
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:437) matrix m1(50, 40), m2(3, 100), m3(46, 220), m4(46, 1024), m5(15, 4096), m6(15, 1024), m7(13, 3072), m8(13, 1024), m9(11, 3072), m10(11, 1024), m11(9, 3072), m12(9, 1024), m13(7, 3072), m14(7, 1024), m15(7, 1024), m16(7, 8850), m17(21, 40), m18(1, 100), m19(21, 220), m20(21, 1024), m21(7, 4096), m22(7, 3072), m23(7, 3072), m24(7, 3072), m25(7, 3072), m26(7, 1024), m27(7, 1024), m28(7, 8850), m29(21, 40), m30(1, 100), m31(21, 220), m32(21, 1024), m33(7, 4096), m34(7, 3072), m35(7, 3072), m36(7, 3072), m37(7, 3072), m38(7, 1024), m39(7, 1024), m40(7, 8850), m41(21, 40), m42(1, 100)
# The following show how matrices correspond to network-nodes and
# cindex-ids.  Format is: matrix = <node-id>.[value|deriv][ <list-of-cindex-ids> ]
# where a cindex-id is written as (n,t[,x]) but ranges of t values are compressed
# so we write (n, tfirst:tlast).
m1 == value: input[(0,-17:32)]
m2 == value: ivector[(0,-21), (0,0), (0,21)]
m3 == value: Tdnn_0_affine_input[(0,-16:29)]
m4 == value: Tdnn_0_affine[(0,-16:29)]
m5 == value: Tdnn_1_affine_input[(0,-15), (0,-12), (0,-9), (0,-6), (0,-3), (0,0), (0,3), (0,6), (0,9), (0,12), (0,15), (0,18), (0,21), (0,24), (0,27)]
m6 == value: Tdnn_1_affine[(0,-15), (0,-12), (0,-9), (0,-6), (0,-3), (0,0), (0,3), (0,6), (0,9), (0,12), (0,15), (0,18), (0,21), (0,24), (0,27)]
m7 == value: Tdnn_2_affine_input[(0,-12), (0,-9), (0,-6), (0,-3), (0,0), (0,3), (0,6), (0,9), (0,12), (0,15), (0,18), (0,21), (0,24)]
m8 == value: Tdnn_2_affine[(0,-12), (0,-9), (0,-6), (0,-3), (0,0), (0,3), (0,6), (0,9), (0,12), (0,15), (0,18), (0,21), (0,24)]
m9 == value: Tdnn_3_affine_input[(0,-9), (0,-6), (0,-3), (0,0), (0,3), (0,6), (0,9), (0,12), (0,15), (0,18), (0,21)]
m10 == value: Tdnn_3_affine[(0,-9), (0,-6), (0,-3), (0,0), (0,3), (0,6), (0,9), (0,12), (0,15), (0,18), (0,21)]
m11 == value: Tdnn_4_affine_input[(0,-6), (0,-3), (0,0), (0,3), (0,6), (0,9), (0,12), (0,15), (0,18)]
m12 == value: Tdnn_4_affine[(0,-6), (0,-3), (0,0), (0,3), (0,6), (0,9), (0,12), (0,15), (0,18)]
m13 == value: Tdnn_5_affine_input[(0,0), (0,3), (0,6), (0,9), (0,12), (0,15), (0,18)]
m14 == value: Tdnn_5_affine[(0,0), (0,3), (0,6), (0,9), (0,12), (0,15), (0,18)]
m15 == value: Tdnn_pre_final_chain_affine[(0,0), (0,3), (0,6), (0,9), (0,12), (0,15), (0,18)]
m16 == value: Final_affine[(0,0), (0,3), (0,6), (0,9), (0,12), (0,15), (0,18)]
m17 == value: input[(0,33:53)]
m18 == value: ivector[(0,42)]
m19 == value: Tdnn_0_affine_input[(0,30:50)]
m20 == value: Tdnn_0_affine[(0,30:50)]
m21 == value: Tdnn_1_affine_input[(0,30), (0,33), (0,36), (0,39), (0,42), (0,45), (0,48)]
m22 == value: Tdnn_2_affine_input[(0,27), (0,30), (0,33), (0,36), (0,39), (0,42), (0,45)]
m23 == value: Tdnn_3_affine_input[(0,24), (0,27), (0,30), (0,33), (0,36), (0,39), (0,42)]
m24 == value: Tdnn_4_affine_input[(0,21), (0,24), (0,27), (0,30), (0,33), (0,36), (0,39)]
m25 == value: Tdnn_5_affine_input[(0,21), (0,24), (0,27), (0,30), (0,33), (0,36), (0,39)]
m26 == value: Tdnn_5_affine[(0,21), (0,24), (0,27), (0,30), (0,33), (0,36), (0,39)]
m27 == value: Tdnn_pre_final_chain_affine[(0,21), (0,24), (0,27), (0,30), (0,33), (0,36), (0,39)]
m28 == value: Final_affine[(0,21), (0,24), (0,27), (0,30), (0,33), (0,36), (0,39)]
m29 == value: input[(0,54:74)]
m30 == value: ivector[(0,63)]
m31 == value: Tdnn_0_affine_input[(0,51:71)]
m32 == value: Tdnn_0_affine[(0,51:71)]
m33 == value: Tdnn_1_affine_input[(0,51), (0,54), (0,57), (0,60), (0,63), (0,66), (0,69)]
m34 == value: Tdnn_2_affine_input[(0,48), (0,51), (0,54), (0,57), (0,60), (0,63), (0,66)]
m35 == value: Tdnn_3_affine_input[(0,45), (0,48), (0,51), (0,54), (0,57), (0,60), (0,63)]
m36 == value: Tdnn_4_affine_input[(0,42), (0,45), (0,48), (0,51), (0,54), (0,57), (0,60)]
m37 == value: Tdnn_5_affine_input[(0,42), (0,45), (0,48), (0,51), (0,54), (0,57), (0,60)]
m38 == value: Tdnn_5_affine[(0,42), (0,45), (0,48), (0,51), (0,54), (0,57), (0,60)]
m39 == value: Tdnn_pre_final_chain_affine[(0,42), (0,45), (0,48), (0,51), (0,54), (0,57), (0,60)]
m40 == value: Final_affine[(0,42), (0,45), (0,48), (0,51), (0,54), (0,57), (0,60)]
m41 == value: input[(0,75:95)]
m42 == value: ivector[(0,84)]

LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c0: m1 = user input [for node: 'input']
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c1: m2 = user input [for node: 'ivector']
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c2: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c3: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c4: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c5: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c6: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c7: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c8: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c9: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c10: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c11: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c12: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c13: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c14: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c15: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c16: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c17: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c18: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c19: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c20: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c21: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c22: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c23: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c24: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c25: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c26: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c27: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c28: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c29: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c30: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c31: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c32: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c33: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c34: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c35: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c36: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c37: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c38: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c39: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c40: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c41: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c42: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c43: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c44: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c45: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c46: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c47: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c48: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c49: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c50: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c51: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c52: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c53: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c54: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c55: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c56: [no-op-permanent]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c57: m3 = undefined(46,220)
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c58: m3(0:45, 0:39) = m1(0:45, 0:39)
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c59: m3(0:45, 40:79) = m1(1:46, 0:39)
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c60: m3(0:45, 80:119) = m1(2:47, 0:39)
ERROR (server[5.5]:ExecuteCommand():nnet-compute.cc:443) Error running command c61: m3(0:45, 120:219).CopyRows(1, m2[0x16, 1x21, 2x9])

[ Stack-Trace: ]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::MessageLogger::LogMessage() const+0x82c) [0x7f91b4683282]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)+0x2e) [0x7f91b426f1cc]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::nnet3::NnetComputer::ExecuteCommand()+0x13d1) [0x7f91b43c3307]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::nnet3::NnetComputer::Run()+0x18a) [0x7f91b43c3526]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::nnet3::DecodableNnetLoopedOnlineBase::AdvanceChunk()+0x4a8) [0x7f91b43d2d18]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::nnet3::DecodableAmNnetLoopedOnline::LogLikelihood(int, int)+0x51) [0x7f91b43d2feb]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::LatticeFasterDecoderTpl<fst::VectorFst<fst::ArcTpl<fst::TropicalWeightTpl<float> >, fst::VectorState<fst::ArcTpl<fst::TropicalWeightTpl<float> >, std::allocator<fst::ArcTpl<fst::TropicalWeightTpl<float> > > > >, kaldi::decoder::BackpointerToken>::ProcessEmitting(kaldi::DecodableInterface*)+0x22b) [0x7f91b42d9117]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::LatticeFasterDecoderTpl<fst::VectorFst<fst::ArcTpl<fst::TropicalWeightTpl<float> >, fst::VectorState<fst::ArcTpl<fst::TropicalWeightTpl<float> >, std::allocator<fst::ArcTpl<fst::TropicalWeightTpl<float> > > > >, kaldi::decoder::BackpointerToken>::AdvanceDecoding(kaldi::DecodableInterface*, int)+0x97) [0x7f91b42d9519]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::LatticeFasterDecoderTpl<fst::Fst<fst::ArcTpl<fst::TropicalWeightTpl<float> > >, kaldi::decoder::BackpointerToken>::AdvanceDecoding(kaldi::DecodableInterface*, int)+0x74) [0x7f91b42da06c]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::SingleUtteranceNnet3DecoderTpl<fst::Fst<fst::ArcTpl<fst::TropicalWeightTpl<float> > > >::AdvanceDecoding()+0x19) [0x7f91b42b8e33]
/opt/kaldi-websocket/_kaldi_recognizer.so(KaldiRecognizer::AcceptWaveform(char const*, int)+0x10a) [0x7f91b426e01a]
/opt/kaldi-websocket/_kaldi_recognizer.so(+0x2c353f) [0x7f91b42a753f]
python3(PyCFunction_Call+0x4f) [0x4e12df]
python3(PyEval_EvalFrameEx+0x614) [0x530b94]
python3(PyEval_EvalFrameEx+0x4b64) [0x5350e4]
python3(PyEval_EvalCodeEx+0x13b) [0x53a81b]
python3() [0x4e3537]
python3(PyObject_Call+0x47) [0x5c3bd7]
python3(PyEval_EvalFrameEx+0x24a2) [0x532a22]
python3(PyEval_EvalFrameEx+0x4b64) [0x5350e4]
python3(PyEval_EvalCodeEx+0x13b) [0x53a81b]
python3() [0x4e3537]
python3(PyObject_Call+0x47) [0x5c3bd7]
python3(PyEval_EvalFrameEx+0x24a2) [0x532a22]
python3(PyEval_EvalFrameEx+0x4b64) [0x5350e4]
python3(PyEval_EvalFrameEx+0x4b64) [0x5350e4]
python3(PyEval_EvalCodeEx+0x13b) [0x53a81b]
python3() [0x4e3423]
python3(PyObject_Call+0x47) [0x5c3bd7]
python3() [0x4f08be]
python3(PyObject_Call+0x47) [0x5c3bd7]
python3(PyEval_CallObjectWithKeywords+0x30) [0x525d00]
python3() [0x626bb2]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f91b86996ba]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f91b83cf41d]

terminate called after throwing an instance of 'kaldi::KaldiFatalError'
  what():  kaldi::KaldiFatalError

hairyone commented 4 years ago

Could this problem be related to the pointer to the model in memory not being accessible to the GPU?

nshmyrev commented 4 years ago

@hairyone I've just pushed https://github.com/alphacep/kaldi-websocket-python/tree/gpu which should make it work, please test. Requires python 3.7 between.

hairyone commented 4 years ago

@nshmyrev Thanks!!

I will try it out and let you know how I get on.

nshmyrev commented 4 years ago

One day we need to support full GPU decoding, not now.

adamreed90 commented 3 years ago

Is this still something that is planning on being brought into master ?

nshmyrev commented 3 years ago

Is this still something that is planning on being brought into master ?

We have it in plans of course, but no immediate plans, we will be busy with other things.

dgxlsir commented 3 years ago

When I compile you app inside a docker container without GPU support everything works fine.

I have made a few changes so that Kaldi is compiled with GPU support and I am running the application inside a docker container with NVIDIA GPU support.

But when I run the GPU version I get the error below, do you have any idea what the problem might be?

server --min-active=200 --max-active=6000 --beam=13.0 --lattice-beam=6.0 --acoustic-scale=1.0 --frame-subsampling-factor=3 --endpoint.silence-phones=1:2:3:4:5:6:7:8:9:10 --endpoint.rule2.min-trailing-silence=0.5 --endpoint.rule3.min-trailing-silence=1.0 --endpoint.rule4.min-trailing-silence=2.0
LOG (server[5.5]:Model():model.cc:47) Sample rate is 8000
LOG (server[5.5]:ComputeDerivedVars():ivector-extractor.cc:183) Computing derived variables for iVector extractor
LOG (server[5.5]:ComputeDerivedVars():ivector-extractor.cc:204) Done.
LOG (server[5.5]:RemoveOrphanNodes():nnet-nnet.cc:948) Removed 1 orphan nodes.
LOG (server[5.5]:RemoveOrphanComponents():nnet-nnet.cc:847) Removing 2 orphan components.
LOG (server[5.5]:Collapse():nnet-utils.cc:1472) Added 1 components, removed 2
LOG (server[5.5]:CompileLooped():nnet-compile-looped.cc:345) Spent 0.0133111 seconds in looped compilation.
WARNING (server[5.5]:SelectGpuId():cu-device.cc:228) Not in compute-exclusive mode.  Suggestion: use 'nvidia-smi -c 3' to set compute exclusive mode
LOG (server[5.5]:SelectGpuIdAuto():cu-device.cc:408) Selecting from 1 GPUs
LOG (server[5.5]:SelectGpuIdAuto():cu-device.cc:423) cudaSetDevice(0): GeForce GTX 1070 free:8022M, used:97M, total:8119M, free/total:0.987992
LOG (server[5.5]:SelectGpuIdAuto():cu-device.cc:471) Device: 0, mem_ratio: 0.987992
LOG (server[5.5]:SelectGpuId():cu-device.cc:352) Trying to select device: 0
LOG (server[5.5]:SelectGpuIdAuto():cu-device.cc:481) Success selecting device 0 free mem ratio: 0.987992
LOG (server[5.5]:FinalizeActiveGpu():cu-device.cc:308) The active GPU is [0]: GeForce GTX 1070  free:7834M, used:285M, total:8119M, free/total:0.964838 version 6.1
ERROR (server[5.5]:CopyToMat():cu-matrix.cc:464) cudaError_t 700 : "an illegal memory access was encountered" returned from 'cudaMemcpy2DAsync(dst->Data(), dst_pitch, this->data_, src_pitch, width, this->num_rows_, cudaMemcpyDeviceToHost, cudaStreamPerThread)'

[ Stack-Trace: ]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::MessageLogger::LogMessage() const+0x82c) [0x7f852b86c2de]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)+0x2e) [0x7f852b458f3c]
/opt/kaldi-websocket/_kaldi_recognizer.so(void kaldi::CuMatrixBase<float>::CopyToMat<float>(kaldi::MatrixBase<float>*, kaldi::MatrixTransposeType) const+0x1ea) [0x7f852b78bb52]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::CuMatrix<float>::Swap(kaldi::Matrix<float>*)+0x12f) [0x7f852b78cba5]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::Matrix<float>::Swap(kaldi::CuMatrix<float>*)+0x12) [0x7f852b78cc24]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::nnet3::DecodableNnetLoopedOnlineBase::AdvanceChunk()+0x5a3) [0x7f852b5bbe6f]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::nnet3::DecodableAmNnetLoopedOnline::LogLikelihood(int, int)+0x51) [0x7f852b5bc047]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::LatticeFasterDecoderTpl<fst::VectorFst<fst::ArcTpl<fst::TropicalWeightTpl<float> >, fst::VectorState<fst::ArcTpl<fst::TropicalWeightTpl<float> >, std::allocator<fst::ArcTpl<fst::TropicalWeightTpl<float> > > > >, kaldi::decoder::BackpointerToken>::ProcessEmitting(kaldi::DecodableInterface*)+0x22b) [0x7f852b4c2173]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::LatticeFasterDecoderTpl<fst::VectorFst<fst::ArcTpl<fst::TropicalWeightTpl<float> >, fst::VectorState<fst::ArcTpl<fst::TropicalWeightTpl<float> >, std::allocator<fst::ArcTpl<fst::TropicalWeightTpl<float> > > > >, kaldi::decoder::BackpointerToken>::AdvanceDecoding(kaldi::DecodableInterface*, int)+0x97) [0x7f852b4c2575]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::LatticeFasterDecoderTpl<fst::Fst<fst::ArcTpl<fst::TropicalWeightTpl<float> > >, kaldi::decoder::BackpointerToken>::AdvanceDecoding(kaldi::DecodableInterface*, int)+0x74) [0x7f852b4c30c8]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::SingleUtteranceNnet3DecoderTpl<fst::Fst<fst::ArcTpl<fst::TropicalWeightTpl<float> > > >::AdvanceDecoding()+0x19) [0x7f852b4a1e8f]
/opt/kaldi-websocket/_kaldi_recognizer.so(KaldiRecognizer::AcceptWaveform(char const*, int)+0x10a) [0x7f852b457d8a]
/opt/kaldi-websocket/_kaldi_recognizer.so(+0x2c29c4) [0x7f852b4909c4]
python3(PyCFunction_Call+0x4f) [0x4e12df]
python3(PyEval_EvalFrameEx+0x614) [0x530b94]
python3(PyEval_EvalFrameEx+0x4b64) [0x5350e4]
python3(PyEval_EvalCodeEx+0x13b) [0x53a81b]
python3() [0x4e3537]
python3(PyObject_Call+0x47) [0x5c3bd7]
python3(PyEval_EvalFrameEx+0x24a2) [0x532a22]
python3(PyEval_EvalFrameEx+0x4b64) [0x5350e4]
python3(PyEval_EvalCodeEx+0x13b) [0x53a81b]
python3() [0x4e3537]
python3(PyObject_Call+0x47) [0x5c3bd7]
python3(PyEval_EvalFrameEx+0x24a2) [0x532a22]
python3(PyEval_EvalFrameEx+0x4b64) [0x5350e4]
python3(PyEval_EvalFrameEx+0x4b64) [0x5350e4]
python3(PyEval_EvalCodeEx+0x13b) [0x53a81b]
python3() [0x4e3423]
python3(PyObject_Call+0x47) [0x5c3bd7]
python3() [0x4f08be]
python3(PyObject_Call+0x47) [0x5c3bd7]
python3(PyEval_CallObjectWithKeywords+0x30) [0x525d00]
python3() [0x626bb2]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f852f8816ba]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f852f5b741d]

terminate called after throwing an instance of 'kaldi::KaldiFatalError'

hello,i am doing the same thing as you , that using the gpu to decode my model, (nnet3-chain).and have you done it? i think if we use the gpu to deel the viterbi decode, must be faster than cpu, and can you give me some advice, thank u very much!

nshmyrev commented 3 years ago

@dgxlsir looks the same as https://groups.google.com/forum/embed/#!topic/kaldi-help/zSNKoD5OHeU, might be an issue with cuda version (too old/too new).

basicasicmatrix commented 3 years ago

I wonder if a good fit for GPU support might be a Vosk Custom Backend for Nvidia's Triton Inference Server?

Triton is BSD3, with some good ground work already done by Nvidias team for Kaldi Inference . A Vosk twist on this could bring a lot of value! :-)

nshmyrev commented 3 years ago

@basicasicmatrix thanks, very useful link

sskorol commented 3 years ago

FYI, I've built 2 docker images with GPU support for Jetson Xavier and Nano based on vosk 0.3.17. Works good so far.

GaetanLepage commented 3 years ago

Hello ! I am currently using vosk for a research project and was wondering whether GPU support would be available anytime soon.

Thank you guys anyway for your nice work :)

sskorol commented 3 years ago

@GaetanLepage hi, GPU itself is supported. You just need to build Kaldi/Vosk with a special flag. You can check this merged PR for details: https://github.com/alphacep/vosk-api/pull/436/files

GaetanLepage commented 3 years ago

Ok thank you very much for the quick answer. I will give it a try :)

vonguyen1982 commented 3 years ago

How should I test GPU with nuget for C# ?

sskorol commented 3 years ago

@vonguyen1982 I don't see corresponding methods that activate GPU in NuGet. But you can create a PR with appropriate updates. Basically, you need to use this Python code as a reference and create the same API here and here. Then rebuild everything with HAVE_CUDA flag and call GpuInit in the main thread of your app, when it's just started. For a multithreaded code, you also have to use GpuThreadInit. You can check the difference between these 2 here.

GaetanLepage commented 3 years ago

@sskorol Thanks to the dockerfile, the PR and the vosk instructions I was able to make it work on my GPU !

I have two remaining questions/issues:

If I use it on several python processes (using the multiprocessing library) I get some ASSERTION_FAILED (VoskAPI:IsComputeExclusive():cu-device.cc:362) Assertion failed: (cudaSuccess == cudaDeviceSynchronize()). I called GpuInit() in my main application (before creating the processes) and GpuThreadInit() in each thread.
Is it possible for it to work on several GPUs ? If so, what should I do ?

Thanks once again for your help !

nshmyrev commented 3 years ago

I called GpuInit() in my main application (before creating the processes) and GpuThreadInit() in each thread.

You should be using threads then, not processes. Of if you still want to use processes you can call GpuInit inside every process after the forks.

GaetanLepage commented 3 years ago

@nshmyrev : I indeed tested both:

Keeping multiprocessing and using GpuInit() in each process
Using multithreading and calling GpuInit() (from the main app) and GpuThreadInit() in each thread

Performance is similar with both. Is there a way I could use several GPUs at the same time ?

nshmyrev commented 3 years ago

Performance is similar with both.

Ok, and what is the problem then?

Is there a way I could use several GPUs at the same time ?

If you really want to use GPU at full speed you need to use kaldi cuda decoders, not vosk. I wrote that above.

GaetanLepage commented 3 years ago

Ok, and what is the problem then?

Oh, nothing, I just wanted to remark that, on my system, running 16 parallel jobs is more or less equivalent than running the system in GPU mode. It was only an observation :)

If you really want to use GPU at full speed you need to use kaldi cuda decoders, not vosk. I wrote that above.

All right ! I will keep this in mind if I happen to need more speed up.

For now, the convenience and usability of vosk are really helping me! Thank you very much for developing this great tool !

vonguyen1982 commented 3 years ago

@nshmyrev Is that possible to add method to allow use GPU in C# nuget package?

nshmyrev commented 3 years ago

@nshmyrev Is that possible to add method to allow use GPU in C# nuget package?

Yes, sure, it is 2 lines ;)

sskorol commented 3 years ago

@vonguyen1982 a fix is already in master. You can check this PR for details: https://github.com/alphacep/vosk-api/pull/514

vonguyen1982 commented 3 years ago

I got. Thanks. I am using latest nuget package with .net core 5.0. I have Vosk.Vosk.GpuInit(); and Vosk.Vosk.GpuThreadInit(); but I did not see any usage from GPU. Do I need specific model or which version of cuda I should install ? Thanks

sskorol commented 3 years ago

@vonguyen1982 you should build Vosk with GPU support on your own. Published versions don't use CUDA. But if you use Docker, you can check some prebuilt images for arm64/amd64.

vonguyen1982 commented 3 years ago

@nshmyrev At the moment the way I use nuget packet with c# is very simple because I don't need to manage vosk server. I wonder if we can make it simple like that when works with GPU rather than I need to build Vosk by my own or using docker image etc. Is that possible ?

nshmyrev commented 2 years ago

Now we have it working. We might need to consider two dockers - one for simple streaming, another for batch.

alphacep / vosk-api

Support full GPU decoding #74