Closed akmalmzamri closed 5 years ago
I think this is due to the fact that they updated the api.
For me it was solved by updating the python client library.
@gguuss can you take a look?
@jerjou /fyi
Currently, speech streaming is limited to 60 seconds, I believe you're somehow reaching that limitation while running the sample. I'll see if I can repro your issue. Also, of particular note, you may be getting content at a higher rate than 16000hz on 1 channel. If this happens, the API tends to "think" that the audio is longer than it actually is. To debug, you might want to try just recording audio to a file and inspecting the file to make sure it's mono / 16khz.
Thanks for the reply @gguuss . The code was made so that a streaming session never exceed 10 seconds. Could it be that the streaming recognize process is not properly ended? So far I haven't be able to reproduce this issue. It only happen randomly and I notice that it happened the most when I didn't speak anything.
@akmalhakimi1991 I think what's happening is that you're recording at a higher rate than 16khz mono (e.g. 44khz stereo) and the API is misreading the request data. Can you verify that you're actually recording data at 16khz by writing to a file and then inspecting the file?
FYI this error currently also occurs when a given audio chunk / request is greater than 6 seconds. Examining sample.code.txt
, I don't see anything obvious that might cause it to buffer that much audio (the chunk
parameter is pretty much ignored), and I can't seem to reproduce this myself by running sample.code.txt
. Here are some guesses as to what might cause 6 seconds of audio to buffer:
arecord
might be silently not supporting 16khz, and giving you, say, 44100hz instead. If the time between requests being sent is large enough (2 seconds, I guess? From estimating the calculation in my head), the server would interpret 2 seconds of 44100hz bytes as 6+ seconds of 16khz bytes.@gguuss Did what you suggested and the audio is recorded at 16000hz. So I guess that's not the problem unless it can randomly change over time.
@jerjou I noticed that this happened a lot (but not all the time) when I'm not speaking in the 10 seconds period. Does this mean anything to you? So far it has not happened for quite some time though.
@jerjou 1 more question. Since I'm not using pyaudio, is there any need to use chunk
parameter? If yes, do you know how can I impleement that?
Re: chunk - basically, you're just using 1024
as the chunk size. It basically just refers to how many bytes of audio you send with each request.
(and I can't think of anything obvious that might cause the error to occur when not speaking for 10 seconds...)
Hi @jerjou and @gguuss. So after not seeing the error for quite some time, it hit me again today. However, this time I managed to save the recording that produced the error (I basically saved the audio string and turn it into wav file). I tried to transcribe the audio using this sample code and it produced the same error. Here's the audio.
Different between this recording and the other recordings is that this recording didn't stop after I finish speaking. This probably caused by the background noise.
Thanks for the reproduction! Unfortunately, using the sample code you're using to reproduce it, the error is expected. Note that on line 41, it says:
# In practice, stream should be a generator yielding chunks of audio data.
The sample code as-is just takes the entire audio file you give it and sends it to the API as one big chunk. Since the audio file is greater than 6 seconds long, the API notices that that's above the "realtime" threshold and returns that error. This is likely what your code is doing as well (ie sending a bunch of data all at once), but it's doing it for a different reason (which reason we're trying to figure out).
Here's a hypothesis. Since python infamously has a global interpreter lock that effectively makes it single-threaded, perhaps the stream_audio
thread is monopolizing the CPU, such that the generator
thread only gets to run once every n
seconds (where sometimes n >= 6
). So:
p.stdout.read
(I'd do something like 16000 * 2 * .1 = 3200
bytes, for 100ms
of audio) - I think p.stdout.read
should block the current thread, giving the generator
thread time to send some of that data.time.sleep
with an appropriately-small value after each p.stdout.read
, since that should definitely sleep the stream_audio
thread. You'll still want to increase the number of bytes you read (1024
bytes is just 1024 / (16000 * 2) = 0.032s = 32 milliseconds
), and sleep for maybe half the time those bytes represent.Let me know if any of that was unclear, and I can explain in more detail.
(Aside: if you expect that your primary audience will speak Indian-English, you might try setting language_code
to en-IN
to get better accuracy.)
@jerjou Thanks for the suggestion. Will try it out and will let you know if I found anything. If I understand what you're saying correctly, by right, the audio stream should not stream the whole audio but instead chunk by chunk even if there's no pause in the speech (including from background noise)? Let say, if I speak non stop for 8 seconds, this issue should not happen since the audio is sent chunk by chunk. Is that correct?
No - I'm still not sure why it might happen more often when there's silence. The audio stream is sent over the network chunk by chunk no matter what - in the sample.code.txt
code you shared up top, you were sending it in chunks of at least [1] 1024 bytes. The script doesn't actually do any sort of pause detection - so silence and speech (and background noise) should look the same to it. So, if you speak non-stop for 8 seconds, the behavior should be the same as if you don't speak at all for the same 8 seconds. At least, this should be the case on the python script side. Is there anything special happening with your audio device, or your operating system, that might be doing silence detection or anything?
[1] I say "at least" here because you're reading it from the arecord
output 1024 bytes at a time, but then the bytes are put into an intermediate self._buff
queue. When the script is ready to send a chunk to the API, it reads the entire self._buff
queue and sends it all in one chunk. So, depending on how often the script sends chunks to the API, it could send one 1024-byte audio, or several of these 1024-byte audios that have accumulated since the last time the script sent chunks to the API.
Is there anything special happening with your audio device, or your operating system, that might be doing silence detection or anything?
Maybe but then it happened in different devices (i.e. in Ubuntu VM on a laptop and in multiple NAO Robot)
There has been no further activity since the discussion above, over a year ago, so closing this issue.
In which file did you encounter the issue?
https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/speech/cloud-client/transcribe_streaming_mic.py
Did you change the file? If so, how?
I added few things and change the recording input to arecord (as I didn't manage get pyaudio to work in the device that I'm working on). I also added a timeout for if user didn't speak for 10 seconds, it will end the recording (it will also end the recording when user finish talking).
Describe the issue
This happen randomly and I can't reproduce this. But sometime, the audio streaming would stop abruptly and return this message:
I believed I did something wrong but can't find the root cause and hence unable to reproduce it. I attached the full modified code. Main program will create the object
SpeechToText
and will call listenSpeech() from time to time. The error occur frequently if user didn't speak anything for 10 seconds (i.e. during the timeout).Weird thing is it didn't happened for the first few days I wrote this code. Then I start seeing it this week and keep getting more frequent today to the point that it is unusable. 1 thing that is change (since I first wrote the code) is that, I shared my VM to my colleague for them to test the program. Is this possible cause? Is there any issue of sharing Google Application Credential json file? sample.code.txt