Intermittent gRPC error while streaming audio from microphone

akmalmzamri commented 7 years ago

In which file did you encounter the issue?

https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/speech/cloud-client/transcribe_streaming_mic.py

Did you change the file? If so, how?

I added few things and change the recording input to arecord (as I didn't manage get pyaudio to work in the device that I'm working on). I also added a timeout for if user didn't speak for 10 seconds, it will end the recording (it will also end the recording when user finish talking).

Describe the issue

This happen randomly and I can't reproduce this. But sometime, the audio streaming would stop abruptly and return this message:

File "/home/nao/.local/lib/python2.7/site-packages/grpc/_channel.py", line 351, in next return self._next() File "/home/nao/.local/lib/python2.7/site-packages/grpc/_channel.py", line 342, in _next raise self grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with (StatusCode.INVALID_ARGUMENT, Invalid audio content: too long.)>

I believed I did something wrong but can't find the root cause and hence unable to reproduce it. I attached the full modified code. Main program will create the object SpeechToText and will call listenSpeech() from time to time. The error occur frequently if user didn't speak anything for 10 seconds (i.e. during the timeout).

Weird thing is it didn't happened for the first few days I wrote this code. Then I start seeing it this week and keep getting more frequent today to the point that it is unusable. 1 thing that is change (since I first wrote the code) is that, I shared my VM to my colleague for them to test the program. Is this possible cause? Is there any issue of sharing Google Application Credential json file? sample.code.txt

ael-computas commented 7 years ago

I think this is due to the fact that they updated the api.
For me it was solved by updating the python client library.

theacodes commented 7 years ago

@gguuss can you take a look?

gguuss commented 7 years ago

@jerjou /fyi

gguuss commented 7 years ago

Currently, speech streaming is limited to 60 seconds, I believe you're somehow reaching that limitation while running the sample. I'll see if I can repro your issue. Also, of particular note, you may be getting content at a higher rate than 16000hz on 1 channel. If this happens, the API tends to "think" that the audio is longer than it actually is. To debug, you might want to try just recording audio to a file and inspecting the file to make sure it's mono / 16khz.

akmalmzamri commented 7 years ago

Thanks for the reply @gguuss . The code was made so that a streaming session never exceed 10 seconds. Could it be that the streaming recognize process is not properly ended? So far I haven't be able to reproduce this issue. It only happen randomly and I notice that it happened the most when I didn't speak anything.

gguuss commented 7 years ago

@akmalhakimi1991 I think what's happening is that you're recording at a higher rate than 16khz mono (e.g. 44khz stereo) and the API is misreading the request data. Can you verify that you're actually recording data at 16khz by writing to a file and then inspecting the file?

jerjou commented 7 years ago

FYI this error currently also occurs when a given audio chunk / request is greater than 6 seconds. Examining sample.code.txt, I don't see anything obvious that might cause it to buffer that much audio (the chunk parameter is pretty much ignored), and I can't seem to reproduce this myself by running sample.code.txt. Here are some guesses as to what might cause 6 seconds of audio to buffer:

If it takes 6 seconds to establish the connection to the server
If for some reason there's network latency that causes a 6 second delay between requests being sent
As gguuss suggests, arecord might be silently not supporting 16khz, and giving you, say, 44100hz instead. If the time between requests being sent is large enough (2 seconds, I guess? From estimating the calculation in my head), the server would interpret 2 seconds of 44100hz bytes as 6+ seconds of 16khz bytes.

akmalmzamri commented 7 years ago

@gguuss Did what you suggested and the audio is recorded at 16000hz. So I guess that's not the problem unless it can randomly change over time.

@jerjou I noticed that this happened a lot (but not all the time) when I'm not speaking in the 10 seconds period. Does this mean anything to you? So far it has not happened for quite some time though.

akmalmzamri commented 7 years ago

@jerjou 1 more question. Since I'm not using pyaudio, is there any need to use chunk parameter? If yes, do you know how can I impleement that?

jerjou commented 7 years ago

Re: chunk - basically, you're just using 1024 as the chunk size. It basically just refers to how many bytes of audio you send with each request.

(and I can't think of anything obvious that might cause the error to occur when not speaking for 10 seconds...)

akmalmzamri commented 7 years ago

Hi @jerjou and @gguuss. So after not seeing the error for quite some time, it hit me again today. However, this time I managed to save the recording that produced the error (I basically saved the audio string and turn it into wav file). I tried to transcribe the audio using this sample code and it produced the same error. Here's the audio.

Different between this recording and the other recordings is that this recording didn't stop after I finish speaking. This probably caused by the background noise.

jerjou commented 7 years ago

Thanks for the reproduction! Unfortunately, using the sample code you're using to reproduce it, the error is expected. Note that on line 41, it says:

# In practice, stream should be a generator yielding chunks of audio data.

The sample code as-is just takes the entire audio file you give it and sends it to the API as one big chunk. Since the audio file is greater than 6 seconds long, the API notices that that's above the "realtime" threshold and returns that error. This is likely what your code is doing as well (ie sending a bunch of data all at once), but it's doing it for a different reason (which reason we're trying to figure out).

Here's a hypothesis. Since python infamously has a global interpreter lock that effectively makes it single-threaded, perhaps the stream_audio thread is monopolizing the CPU, such that the generator thread only gets to run once every n seconds (where sometimes n >= 6). So:

Try increasing the number of bytes you p.stdout.read (I'd do something like 16000 * 2 * .1 = 3200 bytes, for 100ms of audio) - I think p.stdout.read should block the current thread, giving the generator thread time to send some of that data.
You could also try adding a time.sleep with an appropriately-small value after each p.stdout.read, since that should definitely sleep the stream_audio thread. You'll still want to increase the number of bytes you read (1024 bytes is just 1024 / (16000 * 2) = 0.032s = 32 milliseconds), and sleep for maybe half the time those bytes represent.

Let me know if any of that was unclear, and I can explain in more detail.

jerjou commented 7 years ago

(Aside: if you expect that your primary audience will speak Indian-English, you might try setting language_code to en-IN to get better accuracy.)

akmalmzamri commented 7 years ago

@jerjou Thanks for the suggestion. Will try it out and will let you know if I found anything. If I understand what you're saying correctly, by right, the audio stream should not stream the whole audio but instead chunk by chunk even if there's no pause in the speech (including from background noise)? Let say, if I speak non stop for 8 seconds, this issue should not happen since the audio is sent chunk by chunk. Is that correct?

jerjou commented 7 years ago

No - I'm still not sure why it might happen more often when there's silence. The audio stream is sent over the network chunk by chunk no matter what - in the sample.code.txt code you shared up top, you were sending it in chunks of at least [1] 1024 bytes. The script doesn't actually do any sort of pause detection - so silence and speech (and background noise) should look the same to it. So, if you speak non-stop for 8 seconds, the behavior should be the same as if you don't speak at all for the same 8 seconds. At least, this should be the case on the python script side. Is there anything special happening with your audio device, or your operating system, that might be doing silence detection or anything?

[1] I say "at least" here because you're reading it from the arecord output 1024 bytes at a time, but then the bytes are put into an intermediate self._buff queue. When the script is ready to send a chunk to the API, it reads the entire self._buff queue and sends it all in one chunk. So, depending on how often the script sends chunks to the API, it could send one 1024-byte audio, or several of these 1024-byte audios that have accumulated since the last time the script sent chunks to the API.

akmalmzamri commented 7 years ago

Is there anything special happening with your audio device, or your operating system, that might be doing silence detection or anything?

Maybe but then it happened in different devices (i.e. in Ubuntu VM on a laptop and in multiple NAO Robot)

engelke commented 5 years ago

There has been no further activity since the discussion above, over a year ago, so closing this issue.

GoogleCloudPlatform / python-docs-samples