alphacep / vosk-server

WebSocket, gRPC and WebRTC speech recognition server based on Vosk and Kaldi libraries
Apache License 2.0
931 stars 249 forks source link

Raspberry Pi 4 compatible? Streaming interface? #82

Open fquirin opened 3 years ago

fquirin commented 3 years ago

Hi Nickolay,

its good to see you active with a new ASR project :-) As far as I remember we talked a bit back in the Sphinx4 days when I was working on ILA ;-) The follow-up project of ILA is SEPIA and I'm currently trying to figure out if Vosk fits in the picture (spoiler: it looks like ^^). Could you answer me 2 quick questions please:

Thanks for any info and keep up the great work! Florian

nshmyrev commented 3 years ago

Hi. Nice to get in touch again.

Does the Docker container work with Raspberry Pi 4? I've recently compiled Kaldi on the Pi4 to start some new experiments but it took me 3 trys and 10 hours each and I still need to clean up the Kaldi installation because it is an awful large ~10 GB Docker container now :-/. It would be great if I could switch to the Vosk container :-)

You can cross-compile for RPi, it is much faster. We are doing it here: https://github.com/alphacep/vosk-api/blob/master/travis/Dockerfile.dockcross

As for docker, it doesn't work on ARM. Instead, you can install vosk with pip and clone and run the server.

We have a pull request though: https://github.com/alphacep/vosk-server/pull/55

Does the Vosk server require a full wav file before it can start transcribing? Optimally I'd like to stream and transcribe the file while the user is still speaking.

You can use streaming, yes.

fquirin commented 3 years ago

Thanks for the info! I'll check-out the links.

Is there any server example for streaming yet? I'm currently putting together a new Javascript Web Audio library for SEPIA (coming soon) that will be a major improvement over the old one, including VAD and moving most of the processing into AudioWorklets and workers and maybe I can take some Vosk specific stuff into account already. The current SEPIA STT Server has a duplex WebSocket interface that can receive the audio stream, pass it over to Kaldi and send intermediate results, but it is based on a very old Microsoft ASR demo and it would be great if we can find some common ground for the API :-)

To be more precise I'm thinking about the implementation and message format for events like:

sskorol commented 3 years ago

@fquirin I've built kaldi/vosk on RPi4 in Docker before. See the above PR.

Regarding streaming example, you may want to take a look at https://github.com/alphacep/vosk-server/blob/master/websocket/asr_server.py

Btw, which version of RPi4 you have? I mean hardware. I tried on 4Gb RAM version. And Vosk container consumed ~3Gb+ RAM with the big model.

In general, it works well. But if you want top performance with GPU support, I'd recommend using NVIDIA Jetson Nano or Xavier NX instead of RPi4. But there's a drawback though, as on Jetson boards you won't be able to build Docker image yet with GPU, as some libs are not supported yet.

fquirin commented 3 years ago

Hi @sskorol

I've had a look at the 'asr_server.py' earlier but I probably misinterpreted it a bit. I see now that it seems to process chunks of audio until it sees the "eof" message, correct? Besides that I see a message for configuration config.phrase_list and config.sample_rate and a result in unknown format (I think I've seen some info about the result format but cant find it anymore :/). What I'd like to do is build an interface that is very similar to the Web Speech API. Part of it will be client-side (events: onaudiostart, onspeechstart, ...end) but for this to work the server should submit the events (or similar ones) mentioned in my last post. Are you planning to extend this?

About the RPi4. I have a 4 GB model. Your results are interesting, how big exactly is the "big model"? Maybe one can improve conditions a bit more when using pip as Nickolay mentioned above? When you say "it works well" do you remember the approx. real-time factor? ^^

The Jetson Nano has only 2GB RAM as far as I know, so I guess the Xavier NX would be the best choice. [EDIT] I just realized that the Xavier NX is almost 400€ in Germany ... this basically kills it as option :-/

sskorol commented 3 years ago

@fquirin yes, this particular example is running until eof message is received (or you manually stop the server). Regarding the output format, you can check it by executing that python script locally. But in general, it behaves similarly to Google ASR: it gives you interim and final transcribe in a JSON format.

{
  "partial" : "zero one two"
}
{
  "result" : [{
      "conf" : 1.000000,
      "end" : 0.750000,
      "start" : 0.570000,
      "word" : "zero"
    }, {
      "conf" : 1.000000,
      "end" : 1.350000,
      "start" : 1.170000,
      "word" : "one"
    }, {
      "conf" : 0.508337,
      "end" : 2.104042,
      "start" : 1.950000,
      "word" : "nine"
    }, {
      "conf" : 0.452666,
      "end" : 2.160000,
      "start" : 2.104042,
      "word" : "oh"
    }, {
      "conf" : 0.963673,
      "end" : 2.382485,
      "start" : 2.160000,
      "word" : "three"
    }, {
      "conf" : 0.701189,
      "end" : 2.520000,
      "start" : 2.382485,
      "word" : "four"
    }, {
      "conf" : 0.508553,
      "end" : 3.450000,
      "start" : 3.120000,
      "word" : "seven"
    }, {
      "conf" : 0.508553,
      "end" : 3.600000,
      "start" : 3.480000,
      "word" : "eight"
    }, {
      "conf" : 0.649236,
      "end" : 3.720000,
      "start" : 3.600000,
      "word" : "nine"
    }, {
      "conf" : 1.000000,
      "end" : 4.020000,
      "start" : 3.840000,
      "word" : "oh"
    }],
  "text" : "zero one nine oh three four seven eight nine oh"
}

So the final payload will always contain result node.

In terms of ASR events' broadcasting: you just need to check the official python websockets docs to add corresponding support. This repo contains a generic example. And I don't believe that repo owners will make it narrower.

All the models are listed on the official website with their sizes.

Regarding Jetson Nano: I have a 4Gb version like this. Depending on your setup (if you need m2 wi-fi, metal case, fan, etc), the total price might be different. But in general, it's ~$120-170. Not a big deal for a GPU board.

Also note, that on Nano, the big Vosk model also consumes ~3.5Gb RAM w/o Docker. However, it's incredibly fast. Plus, keep in mind that you'll need to build Kaldi/Vosk w/ GPU support manually for this platform.

fquirin commented 3 years ago

Great, thanks! I'll finish my work on the audio library and try to install Vosk on my Rpi4 via pip to play around a bit. If this works out I'll probably write a new Python server combining my old SEPIA STT and the Vosk server :+1:

sskorol commented 3 years ago

@fquirin btw, here's a sample video based on Vosk / Spacy software + Jetson Xavier NX / Respeaker Core v2 / Matrix Voice hardware: https://youtu.be/IAASoRu2ANU

This example uses RU model, but you might be interested to see a response speed while working in a GPU mode. A similar response is expected on Nano board. But on RPi4 it'll be slower w/o GPU.

fquirin commented 3 years ago

Thats super responsive, awesome! Is it a smaller RU model?

sskorol commented 3 years ago

No, it's a big model (2.5Gb). Jetson boards have enough resources to handle big models (which are more accurate). On RPi4 I tried both models. The small model consumes fewer resources, but it's less accurate.

In general, GPU-version of Kaldi/Vosk + big model is the optimal choice for Jetson boards. On RPi4 you can build only a CPU-version which is slower in terms of response time.

fquirin commented 3 years ago

Ok, so this must have been the quickest installation of an open-source ASR system I have EVER seen!!! :astonished: :sunglasses: :scream: :grin:

sudo apt-get install python3-pip git libgfortran3
pip3 install vosk

And then I could already go ahead with the simple Python example.

How on earth can you keep the installation so small? :star_struck: Great work!!! I'm excited to experiment with Vosk!

fquirin commented 3 years ago

I think we can close this issue now ^^. The new WebSocket based SEPIA STT-Server using Vosk is ready for action: https://github.com/SEPIA-Framework/sepia-stt-server and I'm successfully running it on Raspberry Pi4, even with 2GB :smiley: :sunglasses:

sskorol commented 3 years ago

@fquirin I guess you tried a small model with a dynamic graph on RPi4? As I have doubts 2Gb RAM would be enough for large models.

fquirin commented 3 years ago

True, I'm using the Vosk RPi optimized models (~40MB) and the classic ZAMIA speech with custom LM (~40MB as well). I have tried large models as well and the server didn't crash right away :laughing: , but as you said it's probably not very stable, especially when 2 users try to use it at the same time :see_no_evil: .

Btw there is something on the to-do list I wanted to ask. I haven't activated GPU support yet and I can't seem to find the example anymore :thinking: . Are there some instructions how to use the GPU?

sskorol commented 3 years ago

@fquirin I created a repo for building Vosk on Jetson boards + PC with GPU in Docker.

Note that PC scripts have hardcoded versions yet. Still need to polish them. But it's not a blocker. The actual instructions are accurate.

fquirin commented 3 years ago

Thanks :+1: . I quickly checked the code (asr_server.py) and guess I need to translate this somehow into my code:

from vosk import GpuInit, GpuInstantiate
GpuInit()
def thread_init():
    GpuInstantiate()
pool = concurrent.futures.ThreadPoolExecutor(initializer=thread_init)

Btw you have commented out GpuInit() in line 30 but use it in line 35. Is this by accident? ^^.

Is there any more info on GpuInit() and GpuInstantiate()? Assuming I'm already inside a thread for a specific user and don't use another thread pool, should I call both Gpu methods right away?

sskorol commented 3 years ago

@fquirin this example (except the GPU part) was copy-pasted from the Vosk repo. A commented block is just an example of an existing API related to GPU. However, in Python you don't need to call GpuInstantiate as it's for a multithreaded environment only.

BTW, it was changed in the recent version. And here are cpp docs for these 2.

Anyway, GpuInit is enough. Note that if you didn't build sources with HAVE_CUDA flag, these 2 methods are just stubbed.

Also, you are not forced to use this example code. It's just for demo purposes only. The docker image bundles Vosk API the way if you'd installed it locally, but with GPU support. So the only thing you need to do is to call GpuInit on your app's start.

fquirin commented 3 years ago

Thanks for the info and links!

Note that if you didn't build sources with HAVE_CUDA flag, these 2 methods are just stubbed

I'm using the 0.3.30 Wheel files from the release page ... I assume they don't HAVE_CUDA ? :thinking:

sskorol commented 3 years ago

None of the releases have CUDA support by default. Just build it on your own using provided scripts / Dockerfiles or create images based on the existing with CUDA support.

fquirin commented 3 years ago

Ok, ty. I'll put it on the to-do list but I'll probably wait for some "official" CUDA Wheels. I've just spent too much time of my life building ASR systems, fighting build errors, fighting platform errors, trying to reduce size etc. etc. :sweat_smile: . Vosk Wheel files are such a welcome relief to all this :innocent: . Man I still can't believe that my whole Docker image including the small EN and DE models is not even 300MB :star_struck: .