Running on Google Cloud

alphacep / vosk-api

Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node

Apache License 2.0

7.76k stars 1.09k forks source link

Running on Google Cloud #513

Closed jamesoliver1981 closed 3 years ago

jamesoliver1981 commented 3 years ago

Hello, I am using VOSK for my development and it works wonderfully on my local machine. I need to start building my pipeline, and I was looking at using GCS to host my data, workflow and app.

I know that when I call this locally I need to have a folder in my active directory called model to call the model. How can I do this on a cloud based system? Additionally I can import a pre trained model but I need to know information about what tensorflow framework version it is built on and the model artifiacts need to be loaded too. Alternatively I can load a docker container, but not sure about where to start here. Basing this on this link:

I am new to cloud based computing so if my question is not applicable because of the way the structures work, please let me know, and how this works in a cloud environment.

I have also seen that there is a demo of running this in an app. I guess one alternative, could be for the app to process and send the data to the cloud for processing with the other data?

Thank you for your guidance, J

sskorol commented 3 years ago

Hi,

Well, it depends on many factors. You may want to start with the following questions:

does performance matter?
does accuracy matter?
does budget matter?
what kind of project are you going to build on top of Vosk (in terms of additional resources)?
are you familiar with Docker?
do you have basic DevOps skills?

No one can give you guidance w/o requirements. But I can say for sure that a GPU-powered Vosk server hosted at GCP might be quite expensive. A common Compute Engine with 4CPU / 8GB RAM / 1GPU + avg SSD will cost ~$150-350/mo, depending on the machine class (preemptible vs regular). Well, there's an option of making a 1 or 3-year commitment. In this case, there will be a discount. Depending on a model size you might also need a bucket. If you plan to use Docker with GPU, you have to build a corresponding image on your own with CUDA support. On the other hand, if a CPU version of Vosk is enough for you, it shouldn't be too expensive. Moreover, if you package a model into a docker container, you won't even need a bucket.

In terms of a model folder, it's up to you where to put it, as it can be easily changed in code. Or you can map it with any host's folder if you use Docker. But be aware that if you host a model separately, you can lose binaries w/o persistent disk usage (or if you don't configure model auto-deployment when your VM is restarted). BTW, a simple dockerized Vosk server can be found in the following repo.

P.S. You don't need TF to run Vosk.

jamesoliver1981 commented 3 years ago

Thank you for extensive reply, To give more detail on my use case. I am planning to build a tennis app which will use accelerometer data with audio to record the scores. So using Dataflow and buckets for data. Whilst this will use models, the models will only be run on the data and are quite light weight. I have used vosk to extract key words to determine the score. I have thus far used the German small model (vosk-model-small-de-0.15) with specific grammar with excellent results. Whilst my machine has a GPU, I did not enable cuda so unless its a default option in the package I don't need a GPU.

This suggests to me that performance would be ok, accuracy should be ok. Budget does matter. I hope to get to 60 -100k calls on the app per month, which should sink the average costs.

I sadly don't have any experience with docker or any dev ops skills. But I'm hearing it can work which is great. I just need to learn some skills to get there. Any suggestions to get me there... especially in hands on how to deploy the vosk server on GCP.

solyarisoftware commented 3 years ago

Hi, My modest suggestion is to take thing simples :)

Is not clear to me the client/server architecture your are built. Are you using https://alphacephei.com/vosk/server ?
Why just Google? :) I don't know GCP cloud offer, but I guess "any" VPS (virtual private server) with a decent number of cores and sufficient RAM and disk to manage your vosk language model and speech files will do the job. If a vosk small model is ok for you, Vosk RAM requirements are really modest (~< 200MB)! See: https://github.com/solyarisoftware/voskJs/tree/master/tests#transcript-using-english-language-small-model

BTW

Vosk engine doesn't use Tensorflow (as far as I know!)
Vosk doesn't use GPU (I'm not sure, to be honest).

Note: using an ASR offline engine as Vosk on the cloud is feasible, but it's a paradox maybe :) You usually want to setup a speech recognition engine running offline (like Vosk opensource) just because you don't want to use any cloud service (for data privacy reasons, by example). Anyway your need of cloud virtualization makes sense, I admit.

My two cents giorgio

jamesoliver1981 commented 3 years ago

Apologies for the lack fo clarity around my architecture. Challenge is that I have not built anything on a server before and I do not know what I do not know.

Initially I was going for google as the end product is a mobile app and I saw there are tools there (firebase) to create apps, and initally I saw they had a speech to text function. However, I found the results from the speech to text functionality were poor. Looking around I found VOSK which works great and so whilst yes its offline, its perfectly sufficient for my needs - or at least appears so currently.

I also read previously that AWS is always on and therefore more expensive. I had understood that GCP is live when called (except where data is stored in buckets which will persist - I guess - again new to this stuff).

In terms of trying to make my architecture needs clearer.

The intention is that the player plays his game and the movement and audio data syncs to the app and is uploaded to GCP ( / another cloud provider)
The data is manipulated including the audio files being sliced to align points ( in GCP this would be dataflow)
The audio files are then run through vosk and return per line the text
The text is interpretted for scores
Analytics are built & fed into the app (for me currently a blackbox of how)

Thus far on my own machine I have used a variant of this code However as suggested I am looking into docker and using the vosk server models under the docker folder . It appears to call the 0.6 model, which didn't give me such great results locally as the grammar didn't work. I hope once I have it running on GCP (or whereever) I can test to see if the smaller version works for me.

I initially wrote TensorFlow as that is the first option that came up on Google and I assumed the model was based on it. As mentioned, I think locally I am running without engaging my GPU so I should be OK.

Having looked around at docker information, I believe what I need is

my adjusted version of this as my_app
This as the dockerfile
Then the requirements which are partially on the same repo

That should allow me to create the docker image to load to GCP... then I just need to figure out how to run models on GCP.

Does that sound reasonable and like starting simple? I have built a bunch of code for the analysis but I will extend that later.

Very much appreciate your feedbacks and support

PS this suggests GPU is an option.

solyarisoftware commented 3 years ago

Thanks for clarifications. Ok you are using Vosk Server as docker container.

initally I saw they had a speech to text function. However, I found the results from the speech to text functionality were poor.

This statement is weird. Are you referring to GCP speech-to-text API service? In which language? In German? In my experience Google ASR is honestly pretty good (in English / Italian languages) in accuracy and latency. BTW, the service is also pretty tunable for specific input sentences/specific application need.

Vosk is great too but maybe in your case you can avoid any devops scaling configuration headaches (deploying dockers...), just using instead Google ASR APIs with reasonable cost trade-off.

That should allow me to create the docker image to load to GCP... then I just need to figure out how to run models on GCP. Does that sound reasonable and like starting simple?

Not really (for me :))

Having looked around at docker information, I believe what I need is

Sorry I can't help with docker config stuff.

BTW, You maybe do not need expensive Google Cloud services. Maybe you can save money with concurrent/cheaper cloud services proposal. Couldn't a well sized simple VPS be enough for your application?

sskorol commented 3 years ago

Apologies for the lack fo clarity around my architecture. Challenge is that I have not built anything on a server before and I do not know what I do not know.

Initially I was going for google as the end product is a mobile app and I saw there are tools there (firebase) to create apps, and initally I saw they had a speech to text function. However, I found the results from the speech to text functionality were poor. Looking around I found VOSK which works great and so whilst yes its offline, its perfectly sufficient for my needs - or at least appears so currently.

I also read previously that AWS is always on and therefore more expensive. I had understood that GCP is live when called (except where data is stored in buckets which will persist - I guess - again new to this stuff).

In terms of trying to make my architecture needs clearer.

The intention is that the player plays his game and the movement and audio data syncs to the app and is uploaded to GCP ( / another cloud provider)

The data is manipulated including the audio files being sliced to align points ( in GCP this would be dataflow)

The audio files are then run through vosk and return per line the text

The text is interpretted for scores

Analytics are built & fed into the app (for me currently a blackbox of how)

Thus far on my own machine I have used a variant of this code

However as suggested I am looking into docker and using the vosk server models under the docker folder . It appears to call the 0.6 model, which didn't give me such great results locally as the grammar didn't work. I hope once I have it running on GCP (or whereever) I can test to see if the smaller version works for me.

I initially wrote TensorFlow as that is the first option that came up on Google and I assumed the model was based on it. As mentioned, I think locally I am running without engaging my GPU so I should be OK.

Having looked around at docker information, I believe what I need is

my adjusted version of this as my_app

This as the dockerfile

Then the requirements which are partially on the same repo

That should allow me to create the docker image to load to GCP... then I just need to figure out how to run models on GCP.

Does that sound reasonable and like starting simple? I have built a bunch of code for the analysis but I will extend that later.

Very much appreciate your feedbacks and support

PS this suggests GPU is an option.

The dockerfile you mentioned already have a bundled model. So you shouldn't be bothered where to store it in a cloud. You'll just upload an image and that's pretty much it.

In regards to referenced code, it reads a file passed via cli arg. But the model path is hardcoded. Even if you apply the same argument-based approach for model, it won't be a good cloud strategy despite of the fact it works great locally. In a cloud you usually use environment variables for such constants. Moreover, you have to decide if you plan to handle wav files or audio stream. As again, cloud-based environment assumes you have a web server.

BTW, vosk-server images already have a bundled websocket server, which accepts audio stream and returns a transcribe. So when you build and run it, it's ready to receive clients connections and audio chunks.

jamesoliver1981 commented 3 years ago

Thank you for the reply @sskorol . I have follow up questions... I know my request is generic but it comes from not knowing much about cloud computing.

The dockerfile you mentioned already have a bundled model. So you shouldn't be bothered where to store it in a cloud. You'll just upload an image and that's pretty much it.

I understand this is the dockerfile. I understood that within that container I need to create an app.py which is how the model would be used. I took that from this page. The code that follows was along the lines of what I planned to put in app.py

In regards to referenced code, it reads a file passed via cli arg. But the model path is hardcoded. Even if you apply the same argument-based approach for model, it won't be a good cloud strategy despite of the fact it works great locally. In a cloud you usually use environment variables for such constants.

Could you elaborate as to why that won't be a good cloud strategy? I linked to several pieces of code. For clarity, here is how I am using it until now.

def text_from_audio_v5( path, file, lang, location):
    os.chdir("D:/OneDrive/DataSci/Tennis/02_Preprocessing/Voice/VSOK/" + location)
    wf = wave.open(path + file, "rb")
    model = Model("model")

    if lang == "Deutsch":
        rec = KaldiRecognizer(model, wf.getframerate(), '["gewinner","beide","spiel","eins", "null","fehler","fünfzehn","dreizig", "vierzig","einstand", "zwei", "drei","vier" ,"fünf", "sechs","sieben","acht", "neun", "voteil ruck", "vorteil auf", "zweite", "erste", "aufschlag"]')
    if lang == "English":
        rec = KaldiRecognizer(model, wf.getframerate(), 
                '["love", "fifteen","thirty","forty","deuce", "mistake","winner","double fault", "second serve", "let"," my advantage","your advantage","all","game", "fifteen all","fifteen love", "love fifteen","fifteen thirty","thirty fifteen", "thirty love", "love thirty", "love forty", "fifteen forty","thirty forty", "forty thirty", "forty fifteen", "forty love" ]')
    results = []
    while True:
        data = wf.readframes(4000)
        if len(data) == 0:
            break
        if rec.AcceptWaveform(data):
            results.append(json.loads(rec.Result())['text'])
    results.append(json.loads(rec.FinalResult())['text'])

    return str(results)

Can I not do something similar / something that would achieve the same result? Why is it not a good idea?

Moreover, you have to decide if you plan to handle wav files or audio stream. As again, cloud-based environment assumes you have a web server.

Here I plan to pass chunks of audio which will be wav files. These files will be loaded to the GCP buckets, which fufills the webserver requirements no?

BTW, vosk-server images already have a bundled websocket server, which accepts audio stream and returns a transcribe. So when you build and run it, it's ready to receive clients connections and audio chunks.

I'm a little confused here. This suggests to me that you are saying this will work, but my proposed solution won't be good. Could you elaborate on what you mean please?

My current plan is to dockerise the vosk server file in the repo, with the code above, which would then generate an output into another file to interpret the text. Do you think this wouldn't work and is there another solution you would recommend?

sskorol commented 3 years ago

You are mixing up 2 different approaches. Reading wav files from local file system is not the same as streaming audio chunks via websockets. The code you've shared just reads local files. An official vosk-server doesn't use files, it consumes an audio stream from the network. If you use docker, working with file system won't be so easy and straightforward (especially in a cloud), as containers have an isolated environment, disks are not always persistent, there are different storage types, location always matters, etc.

The other link you've shared (flask code) uses REST, which is not a media streaming protocol. Of course you can send media data via REST, but it's not intended to be used for real-time streaming.

I re-read your app's requirements, and for me it seems you're designing it in a wrong way. If you plan to collect audio data in real-time directly from the player, you should use a streaming protocol. It won't be efficient to save files first and then try to read and transcribe them. It's better to stream an audio directly to vosk server, apply your business logic and only after that save the file asynchronously to cloud storage (if you really need files).

On the other hand, lost of people forgot one important thing while designing voice apps - a microphone and environment. Bad microphone in a noisy environment may fail the whole app's idea. Did you test your code in a real environment, with the real device?

solyarisoftware commented 3 years ago

Can I not do something similar / something that would achieve the same result?

reading your code, @jamesoliver1981, simply you are not using the Vosk server (on the docker container), instead you are using Vosk API in a python program, transcripting a wav file.

Why is it not a good idea? For me that (you code) is not necessarily a bad approach, but see below:

My current plan is to dockerise the vosk server file in the repo, with the code above, which would then generate an output into another file to interpret the text.

This goal seems unrelated from the code you posted above. let's recap:

Vosk-API (to create your own server) you can deploy your program above on a cloud or local server, as basic idea..., because you have to create a server "on-top". You can get inspiration reading Vosk-Server.

BTW I implementing an experimental demo server in Nodejs. See: https://github.com/solyarisoftware/voskJs/tree/master/examples#transcript-http-server

You are not obliged to put your server solution on a docker container, or please tell me why you want to do this :-). See https://github.com/alphacep/vosk-api/ (this repo!) for details.
Vosk-Server (just interface it) You can deploy the Vosk docker image (again on a cloud or a local private server) and in this case you have to use one of foreseen interfaces (websocket, gRPC, etc.). So, if your speech "is" a WAV file, you have to convert it on a buffer stream and submit the stream to the Vosk-Server. See https://github.com/alphacep/vosk-server for details.

jamesoliver1981 commented 3 years ago

@sskorol In terms of a few of your comments:

On the other hand, lost of people forgot one important thing while designing voice apps - a microphone and environment. Bad microphone in a noisy environment may fail the whole app's idea. Did you test your code in a real environment, with the real device?

Absolutely correct. I have tried this out live. This is the reason I'm determined to make VOSK work. I originally tried Google Speech to Text with the recordings on the tennis court, wind and background noise, and it was awful. When I used VOSK, with grammar, it works pretty much 100%. I don't mind paying for a service but like I said, google didn't work, and assembly AI is great for English but doesn't have German. Speechmantics is much more expensive.

I re-read your app's requirements, and for me it seems you're designing it in a wrong way. If you plan to collect audio data in real-time directly from the player, you should use a streaming protocol.

This might just be semantics but this won't be live streaming. I will upload one large wav file, so I would call this batch. It needs to be cut up to match where the points are taking place so the score can be applied at the right time. The files wouldn't be needed afterwards. I thought I would save them for processing and then delete them once they had been used. Therefore incurring only a small cost (this is an assumption on my part). Of course if you have a better solution, I am more than happy to hear it.

As an example of how the output looks right now: Here you can see a point, and the text is the output of VOSK. I then extract each word and and run a bunch of logic on this.

You are mixing up 2 different approaches. Reading wav files from local file system is not the same as streaming audio chunks via websockets. The code you've shared just reads local files. An official vosk-server doesn't use files, it consumes an audio stream from the network. If you use docker, working with file system won't be so easy and straightforward (especially in a cloud), as containers have an isolated environment, disks are not always persistent, there are different storage types, location always matters, etc.

I need to use docker as that is the only way to load this model to GCP. I think the crux of the issue is that I have no experience of working with docker nor GCP thus far and I resolved yesterday to try out some simple things first to actually see how GCP works with models I build in GCP before I move on to trying to make VOSK work (smaller steps rather than one big one). If there is an example of how an official vosk server works then that would help me tremendously as a guiding principle.

solyarisoftware commented 3 years ago

I tend to agree on your points @sskorol, but ...

You are mixing up 2 different approaches. Reading wav files from local file system is not the same as streaming audio chunks via websockets. The code you've shared just reads local files. An official vosk-server doesn't use files, it consumes an audio stream from the network.

Right. But let's think about the global architecture and performances. In general, we probably all agree about the fact that, to minimize latencies, websocket networking is a better approach than reading files from disk, but we have to reason considering the all chain, from devices to the servers. Because, an architecture where you have say a multi-layer server architecture where the ASR is by example a websocket-interfaced Vosk Server on a remote cloud system, not necessarily give you better latency that a servers system running by example on your local (on-premise) systems with a fast disk.

So my question to @jamesoliver1981 is: can you explain the speech data flow you foreseen, from devices (a mobile phone) to the ASR ?

If you use docker, working with file system won't be so easy and straightforward (especially in a cloud), as containers have an isolated environment, disks are not always persistent, there are different storage types, location always matters, etc.

Absolutely.

The other link you've shared (flask code) uses REST, which is not a media streaming protocol. Of course you can send media data via REST, but it's not intended to be used for real-time streaming.

Well, that's true but opinabile. Sending binary data in the body of an HTTP request is not exactly smart, but not necessarily evil. It depends on the speech size in bytes, etc. Also, as far as I understand the application do not require "realtime streaming"

I re-read your app's requirements, and for me it seems you're designing it in a wrong way. If you plan to collect audio data in real-time directly from the player, you should use a streaming protocol. It won't be efficient to save files first and then try to read and transcribe them. It's better to stream an audio directly to vosk server, apply your business logic and only after that save the file asynchronously to cloud storage (if you really need files).

I partially agree. Application requirements/behaviours are not clear.

On the other hand, lost of people forgot one important thing while designing voice apps - a microphone and environment. Bad microphone in a noisy environment may fail the whole app's idea. Did you test your code in a real environment, with the real device?

Audio I/O is a constatnt of the problem :-)

solyarisoftware commented 3 years ago

@jamesoliver1981

1/ Accuracy:

This is the reason I'm determined to make VOSK work. I originally tried Google Speech to Text with the recordings on the tennis court, wind and background noise, and it was awful.

Premising that I'm a supporter of great Vosk, and in general a bit anti-Google...I have to be fair as engineer and, as I said, I'm really perplexed about the general claims about bad accuracies using Google Speech-to-Text ASR (I didn't try German, I admit). I'd happy to read your tests/benchmarks results, knowing the way you tested and concluded that is "awful". BTW, "wind and background noise" would be a constant in your tests using different ASRs.

When I used VOSK, with grammar, it works pretty much 100%. I don't mind paying for a service but like I said, google didn't work, and assembly AI is great for English but doesn't have German. Speechmantics is much more expensive.

What do you mean with "VOSk with grammar"? The large model "vosk-model-de-0.6"?

2/ To docker or not to docker:

I need to use docker as that is the only way to load this model to GCP. I think the crux of the issue is that I have no experience of working with docker nor GCP thus far and I resolved yesterday to try out some simple things first to actually see how GCP works with models I build in GCP before I move on to trying to make VOSK work (smaller steps rather than one big one). If there is an example of how an official vosk server works then that would help me tremendously as a guiding principle.

As far as I know you are not obliged to deploy docker containers with GCP. By example if you use virtual private servers it's just an option; Quoting from page: https://cloud.google.com/compute/docs/instances#introduction:

Compute Engine instances can run the public images for Linux and Windows Server that Google provides as well as private custom images that you can create or import from your existing systems. You can also deploy Docker containers, which are automatically launched on instances running the Container-Optimized OS public image.

3/ To GCP or not to GCP: Last but not least, for your needs you can probably use any cloud provider in the planet Earth (possibly in a near location). Thanks god, Google is not the only option ;-)

jamesoliver1981 commented 3 years ago

Thanks for the replies @solyarisoftware .

You are not obliged to put your server solution on a docker container, or please tell me why you want to do this :-).

When loading an external model to GCP, it needs to be in a container - detailed here

Also, as far as I understand the application do not require "realtime streaming" can you explain the speech data flow you foreseen, from devices (a mobile phone) to the ASR ?

Correct I do not need real time streaming.
The flow is

Player plays the the match
Sensor data and audio data is uploaded at the end of that match
Sensor data identifies what were shots and points
Audio is cut to align to those points so that the scores and outcomes can be aligned to that point
Audio is converted to text and aligned to assign the outcome with the behaviour of the point

Vosk-Server (just interface it) You can deploy the Vosk docker image (again on a cloud or a local private server) and in this case you have to use one of foreseen interfaces (websocket, gRPC, etc.). So, if your speech "is" a WAV file, you have to convert it on a buffer stream and submit the stream to the Vosk-Server. See https://github.com/alphacep/vosk-server for details.

I looked into VPS having seen your previous reply, and having done a little research thought yes its possible but not something I know how to start with, plus GCP seems to be the more scaleable solution. My app will likely be created using Firebase and will send the data to GCP for the processing.
However I think what you are suggesting is that I, at a minimum, do the audio processing (using VOSK to context text) on a private server so that I don't need to put it into a container. I would then need to feed that input back into GCP for finalising of data processing, and returning information to the mobile app. Correct? Thank you for the inspiration. Let me know if I have understood your recommended architechture correctly.
I am still not clear on how the VPS solutions would work but that comes from my lack of knowledge in working with remote servers. I guess I can ask more once you confirm that I understand your architechture correctly

jamesoliver1981 commented 3 years ago

What do you mean with "VOSk with grammar"? The large model "vosk-model-de-0.6"?

No. I was pointed to 0.15 by a fellow collaborator

I'd happy to read your tests/benchmarks results, knowing the way you tested and concluded that is "awful". BTW, "wind and background noise" would be a constant in your tests using different ASRs.

I used this to run some tests. Both English and German.

As an example: the returned value should be fünfzehn null fehler (15-0 mistake) Google returns: VOSK API returned (the second line here) - which is perfect

This happens consistently. I get better results using VOSK which is open source than paying Google for nonsense.

Docker: if I want to load an external model to GCP, then it needs to be in a docker container

3/ To GCP or not to GCP:

Who would you recommend? As I guess is clear, this is my first forray into cloud computing and I need support to get it up and running. There is lots of documentation on Google but its not entirely clear to a noob like me. I thought of using GCP over AWS as I understood that google can be active when required, rather than being always on. Happy to be corrected.

solyarisoftware commented 3 years ago

When loading an external model to GCP, it needs to be in a container - detailed here

but wait, this is a solution for particular requirements. IMMO you don't need all this to deploy a Vosk server solution on the above cloud service. What you need, at list in a first phase, is maybe just a (single) VPS (virtual private server) well sized (cores, RAM etc.) for your app. Let's start simple!

The flow is

Player plays the the match

Sensor data and audio data is uploaded at the end of that match

Sensor data identifies what were shots and points

Audio is cut to align to those points so that the scores and outcomes can be aligned to that point

Audio is converted to text and aligned to assign the outcome with the behaviour of the point

Well, this is the application workflow. What is not clear to me is how you catch the speech in your device? End user press a button and afterward what happens? The speech is recorded on the phone as file and so submitted to a server? How the file is sent?

However I think what you are suggesting is that I, at a minimum, do the audio processing (using VOSK to context text) on a private server so that I don't need to put it into a container. I would then need to feed that input back into GCP for finalising of data processing, and returning information to the mobile app. Correct?

Correct. This is a brainless solution I'd propose you. It's what I meant for "start simple" in my first comment :-)

I am still not clear on how the VPS solutions would work but that comes from my lack of knowledge in working with remote servers. I guess I can ask more once you confirm that I understand your architechture correctly

It's trivial:

test your solution on "dev environment", your local sever (it could be even a desktop computer).
create a well sized VPS everywhere on the cloud, if you don't want any hardware server for any good reason.
"copy" your application code on the VPS.

solyarisoftware commented 3 years ago

What do you mean with "VOSk with grammar"? The large model "vosk-model-de-0.6"?

No. I was pointed to 0.15 by a fellow collaborator

interesting, thanks for the link.

I used this to run some tests. Both English and German.

Well, this is not a good test, sorry. For many reasons:

It depends on the audio (mic, etc.) of your PC...
It depends on (minor) demo web interface issues
That demo web page could not use exactly the cloud service you will use with the end paid service.
most important: you are not using the sames source speech (files) of your real application!

A better test to do, in my modest opinion, is this:

Collect real speeches of your APPLICATION. Use these files as FIXED test-set of N files, smart chose in terms of acoustic model diversity... (N>>10, biggest is batter).
Create a simple test program to submit your test-set files to GCP Speech-to-text APIs
Create a simple test program to submit your test-set files to Vosk API (you already did the program)
Create a simple test program to submit your test-set files to ANY ASR you want to compare

Collect results, maybe using WER or any sort of more complex metric you prefer.

jamesoliver1981 commented 3 years ago

What is not clear to me is how you catch the speech in your device? End user press a button and... ? Currently I am using a small microphone which the player wears whilst playing. Currently I simply upload the data and merge it. I need to have the hardware produced. Or do you mean something else?

Thank you for the comments on the unecessary need to load via docker. I will look into setting up my own server. Great that it seems so simple to you. I need to learn about setting that up, but given your recommendation it seems a good approach.

It depends on the audio (mic, etc.) of your PC... It depends on (minor) demo web interface issues it's a demo probably ... not the cloud service you will pay. most important: you are not using the sames source speech (files) of your real application.

No, this is the same file that is being tested, and that file is from the court. Perhaps on the demo vs the cloud service. But should it not also give me confidence in it being able to perform the task? Your testing solution is much more robust. However unless I cannot make VOSK work, there is no need for me to do this comparison.

solyarisoftware commented 3 years ago

No, this is the same file that is being tested, and that file is from the court.

I see. Better. But I would insist to use the GCP ASR API, not the web interface.

However unless I cannot make VOSK work, there is no need for me to do this comparison.

But you already did the Vosk API program :)

To measure accuracy of different systems, you don't need to setup any server architecture.

jamesoliver1981 commented 3 years ago

Once I've got VOSK running on a server, I will run some tests to generate those outputs. VOSK really improved once I used the grammar function so maybe that is a requirement for GCP ASR too. But I want to get this VOSK speech element nailed first.

Did you have a recommendation for who to use rather than google? I choose them for being on demand and the easy link to firebase (assuming that this is a good way to create an app)

sskorol commented 3 years ago

As @jamesoliver1981 uses a custom vocabulary, it'll always be more accurate than any existing ASR trained on a general domain. Model adaptation works only on small models with a dynamic graph though.

Of course no one forces you using Docker. But dockerized apps are much easier to maintain in a cloud, as you don't pollute your host's OS with tons of dependencies and configurations. Moreover, in this case you also need to either make disk snapshots with auto-recovery scripts (as common virtual disks don't persist and can be replaced any time), or buy a physical storage which also requires backups though.

With docker you already have such snapshots out of the box for free. You can't break anything or introduce conflicts with the other system components. An isolated environment is always more robust. You can easily scale and extend it. You can easily integrate your service with any other dockerized components and create a Kubernetes cluster.

Also, if you plan using GPU some time in the future, I don't believe you want to build Kaldi and Vosk manually on host's OS.😉

solyarisoftware commented 3 years ago

ave a recommendation for who to use rather than google? I choose them for being on demand and the easy link to firebase (assuming that this is a good way to create an app)

I'd go with anyone. In Europe, I use OVH that has a datacenter in Germany, by example. But warning! Sometime OVHs datacenters are set on fire ;-)

solyarisoftware commented 3 years ago

As @jamesoliver1981 uses a custom vocabulary, it'll always be more accurate than any existing ASR trained on a general domain. Model adaptation works only on small models with a dynamic graph though.

For sure!

But please note that Google ASR, since March 2021, optionally allows to personalize the pretrained language models, using custom vocabularies, they call "model adaptation". See:

With docker you already have such snapshots out of the box for free. You can't break anything or introduce conflicts with the other system components. An isolated environment is always more robust. You can easily scale and extend it. You can easily integrate your service with any other dockerized components and create a Kubernetes cluster.

Maybe the final choice. But before all evaluate biz traffic estimation and tech trade-offs of scaling up using dockers clusters with Kubernetes. ;-)

I'd suggest @jamesoliver1981 to chose the target ASR technology, going one step after another:

Option 1: Use Vosk opensource!

1 Create your own (custom) server
- Alternative 1.1: Using Vosk-API; maybe you want to trigger recognizer when your server receive the speech file.
- Alternative 1.2: Sending audio to a Vosk-Server, from your devices, using websockets.
2 Deploy the server on any well sized VPS. Maybe this is enough for your specific app
3 Afterward, if you are working on a your millions users killer-app, so set-up a docker and enjoy Kubernetes magics...

Option 2: Use Google ASR API (tuning it to improve accuracy for your custom context).

In both the options you have dev/ops costs/performances pros and cons ;-) The main reason why you would opt for Vosk is PRIVACY, that's maybe not your main requirement for the kind of application.

jamesoliver1981 commented 3 years ago

Hi @solyarisoftware, Following your prompt, I've done a little more research on starting with VPS and see it is recommended by multiple people to start small. The main thing I want to understand before I start down a road is how the app communicates with the VPS. As I've written above, there is quite some data being sent back and forth. I've not been able to find anything about how to do this after a couple of hours of websearch. If I were to go with your simple solution, my steps would be: 1) Set up my own server for test. ( can find tutorials on doing this) 2) Run my code above and all the other code around it on that to test it works (whilst I see I can put my functions into git and then pull it on the server, I want the final function to return a summary output table that I/the app can do analysis on. Here I don't get how I would access that output table.). 3) Develop an initial app that sends data to the server and returns one chart (can find tutorials on creating an app, but not on how to link to a VPS / server) 4) Move to a VPS for initial part of business. 5) Once it is clear this app will take over the world, move to docker and scale

Do you have any recommendations / links for those areas which are missing above? I want to start with the end in mind. So I want to know how I can build the app linked to this before I start.

I really appreciate the effort both of you are putting into help me get started here. This is so incredibly helpful. I know I keep asking questions but I am really at the beginning and your guidance is very helpful in understanding options and next steps

solyarisoftware commented 3 years ago

your "how to deploy" sequence makes sense for me.

You finally did the main question for the design perspective:

The main thing I want to understand before I start down a road is how the app communicates with the VPS.

So, some solution paths are already suggested by me in previous comment and by @sskorol too.

Your app has to submit to the server the audio, so you have to decide how to and there are some options:

Your app send (uncompress WAV?) audio buffers via websockets (an in this case you will probably happy using on the sever Vosk-Server, as is)
Your app send the (compressed) audio streams to the server. How? (websockets/socketio/HTTP POST body...). Uncompress the received files to a WAV... (use ffmpeg or what you want) and in this case you can use, server side, almost the program script you already done.

In theory the first option is better, but please keep attention about the audio format in which the device send data to the sever: probably you want to/have to send compressed data, e.g. OPUS, NOT WAV, to minimize net bandwidth); but you have to submit uncompressed (WAV) binary stream to the Vosk Server (to be verified). If so you have to uncompress the received data (e..g. from OPUS to WAV) and submit the final WAV to the Vosk ASR.

Take inspiration from: https://github.com/alphacep/vosk-server/tree/master/websocket

solyarisoftware commented 3 years ago

@jamesoliver1981 coul you close the issue? :-)

jamesoliver1981 commented 3 years ago

I am still working on getting this to work. Once i have, I want to document a how to for those who might want to do this in the future. And so I would close with that. I hope I can do that in the next 2 weeks

jamesoliver1981 commented 3 years ago

Sorry for the delay. I'm closing this now. I never got this to run on GCP. Uploading the API to interpret the inputs didn't work for me. VPS is my current solution where it works seamlessly

Caet-pip commented 1 year ago

Hello @sskorol could you expand on the Docker solution. I am trying to run VOSK as a ASR service in Azure for Telephony purpose but am not sure where to start with this approach and if this is a good implementation for the business case mentioned (telephony analytics in call centers). Could you please guide me, or point me to any online guides for doing the same. I am aware of deploying ML models locally and am just starting to research on deploying it as an endpoint for client. Thanks!

sskorol commented 1 year ago

Hi @Caet-pip. I don't have an example for Azure, but you can check my repo which covers GCP deployment via Terraform. It might become a good starting point for you.