alphacep / vosk-api

Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node
Apache License 2.0
7.43k stars 1.04k forks source link

Clearing input buffer before capturing new data #1399

Open prevoste opened 1 year ago

prevoste commented 1 year ago

Hi,

I am working on a project that uses ALSA to capture words and phrases from a microphone which is then passed onto Vosk-Api (in c) to convert into text. This is working very well and I am impressed on the accuracy of the conversion. I then take the result of the conversion and output appropriate message via a speech synthesiser.

I assumed that any new spoken phrases would not be captured until I issued the next snd_pcm_readi(handle, buffer, frames); call. But what seems to be happening is while the first phrase is being processed and the message spoken, ALSA is still capturing spoken data via the microphone. This means that some of the spoken speech from the synthesiser is being captured via the mic and causing a feedback loop!

Is there a way of flushing / deleting unprocessed data in the input buffer before start a new phrase capture?

Thanks you for any help you can give.

Ernie

nshmyrev commented 1 year ago

recognizer.Reset() should reset recognizer buffer if you need it.

https://github.com/alphacep/vosk-api/blob/master/python/vosk/__init__.py#L196

In general it is better to have a microphone setup with acoustic echo cancellation, so the playback is cancelled out by specialized hardware/algorithm. Respeaker boards have it for example.

prevoste commented 1 year ago

Hi, Thanks so much for the response. I assume when you say use recognizer.Reset() that is using snd_pcm_reset(handle) in a ‘c’ program. I have tried an experiment. I first captured a first utterance and printed it out, then then put the program to sleep for 10 seconds using sleep(10) and then did a snd_pcm_reset(handle). During the 10 seconds pause I spoke a second utterance, before the reset was executed. When the pause finished the second utterances was printed out! I would be grateful if you have any idea what is going on. Can you suggest any suitable AEC microphones, perhaps from Amazon or are these expensive studio type devices? Thanks again Ernie

nshmyrev commented 1 year ago

then then put the program to sleep for 10 seconds using sleep(10) and then did a snd_pcm_reset(handle)

I would better simply ignore the recorded audio than try to deal with alsa buffers

Can you suggest any suitable AEC microphones, perhaps from Amazon or are these expensive studio type devices?

https://wiki.seeedstudio.com/ReSpeaker_Mic_Array_v2.0

it is not very expensive. The software is awful though.

You can also have software aec:

https://github.com/orgs/SEPIA-Framework/discussions/152

prevoste commented 1 year ago

Hi Nickolay,

Thanks again for your reply.

Not sure if I fully understand what you are suggesting with “I would better simply ignore the recorded audio than try to deal with alsa buffers”. Are you suggesting after processing the first utterance ignoring any audio for a period of time, if so are the separate utterances time stamped in any way?

Ernie

nshmyrev commented 1 year ago

There are timestamps of course but you don't need them, you can just receive recognizer results while playing the response but simply drop them. You drop the results, not the audio buffer.

prevoste commented 1 year ago

Hi Nickolay,

Thanks again for your help. It is very much appreciated.

I was not sure what function I should use to “drop the results” I tried various options, the only one that seem to work was “snd_pcm_prepare(handle);” called after player the response. Does this seem right?

At the end of this process who is responsible for closing the issue?

Ernie

prevoste commented 1 year ago

Hi,

I am still having problems getting my program working.

The program is intended to work like any other voice assistant. Listening for voice input and then using TTS to speak the response.

VOSK works great in capturing the speech input and I am using Python PYTTSX3 to generate the response. What I need is a way of stopping VOSK listening while PYTTSX3 speak the response otherwise it takes the spoken output as input.

What have done to try and make this work is the following:

Created the PYTTSX3 program as a Raspberry Pi system service, it creates a socket to listen on. When it receives a connection with text, it generates the speech.

A second ‘c’ program uses VOSK to listen for input, when it gets a complete spoken phrase it makes connection (using sockets) to the service program and send the text to be spoken. This all seems to work well, expect that I am still getting the generated output captured as input from time to time. I have tried several things to correct this. The idea I had was to get the PYTTSX3 program to generate an ‘ACK’ reply to the VOSK program and have the VOSK program wait until it receives the ‘ACK’ message before continuing the capture new sounds. The problem seems to be that VOSK is still listening even when I think is not. You said in a previous reply ‘you can just receive recognizer results while playing the response but simply drop them’. I am not sure I understand how I should do that. I guess I don’t have a clear understand how VOSK works in the background. I would appreciate any help you could give me.

Thanks

Ernie

nshmyrev commented 1 year ago

The problem seems to be that VOSK is still listening even when I think is not.

Vosk is a library, what is the code of your application? What makes you think it is listening?

prevoste commented 1 year ago

Hi,

Thanks for the quick reply.

This function below is called from within a while loop until I ask the program finish with an ‘exit program’ command.

What makes think it is still recording is that I see the partial phrase being displayed even when I say nothing else.

If I put an artificial time delay in after receiving the ‘ACK’ message I get ‘overrun occurred’ message and seems to work then!

How should process the input to get the effect I want?

Thanks again

Ernie

int processLoop(void) {
    char recBuff[10];
    rc = snd_pcm_readi(handle, buffer, frames);
    if (rc == -EPIPE) {
      /* EPIPE means overrun */
      fprintf(stderr, "overrun occurred\n");
      snd_pcm_prepare(handle);
    } else if (rc < 0) {
      fprintf(stderr,
              "error from read: %s\n",
              snd_strerror(rc));
    } else if (rc != (int)frames) {
      fprintf(stderr, "short read, read %d frames\n", rc);
    }
    final = vosk_recognizer_accept_waveform(recognizer, buffer, size);
    if (final) {

// Parse the structure into a Json object
        jobj = json_tokener_parse(vosk_recognizer_result(recognizer));

        if (json_object_object_get_ex(jobj, "text", &jtextdata) ) {
            type = json_object_get_type(jtextdata);
            printf("%d  Text1: %s\n", cnt, json_object_get_string(jtextdata));
            if (strlen(json_object_get_string(jtextdata)) != 0) {
                if (strstr(json_object_get_string(jtextdata), "exit program") != 0) {
                    printf("Exiting program\n");
                    return(-1);
                } else if (strstr(json_object_get_string(jtextdata), "huh") != 0) {
                    printf("Ignore\n");
                } else {
                    write(sockfd, json_object_get_string(jtextdata), strlen(json_object_get_string(jtextdata)));
                    rc = read(sockfd, recBuff, 3);
                    printf("Rec Char %s\n", &recBuff[0]);
                    if (strstr(&recBuff[0], "ACK") != 0) {
                        printf("ACK received\n");
                    }
                }
            }
        }
    } else {
        jobj = json_tokener_parse(vosk_recognizer_partial_result(recognizer));
        if (json_object_object_get_ex(jobj, "partial", &jtextdata) ) {
            if (strlen(json_object_get_string(jtextdata)) != 0) {
                printf("%d   Text2: %s\n", cnt, json_object_get_string(jtextdata));
            }
        }
    }
    cnt++;
    return(0);
}
prevoste commented 10 months ago

Hi,

I am still having problems capturing audio speech direct from a microphone to be processed using VOSK. I am using the alsa/asoundlib library to capture the audio. After opening the microphone I read the audio using snd_pcm_readi(handle, buffer, frames);. I then pass the buffer to final = vosk_recognizer_accept_waveform(recognizer, buffer, size); If this return true I then continue to process and extract the spoken text. This all seems to work as expected.

The problem I am still having is that if I leave the microphone open listening for anything to be spoken but don’t speak the program % memory continues to increase to a point around 82% when the program hangs. You can see from the 2 attached images from a htop display at first when the vosk model is loaded the program is using about 7.9% but left to run in silence it increases to about 82% over 30 plus minutes, and then hangs.

I can kill the program with a cntrl-c which frees the memory.

Can anyone suggest what I am doing wrong? How and when should any memory buffers be freed.

Ernie

Image1 Image2