eliihen / wsta

A CLI development tool for WebSocket APIs
GNU General Public License v3.0
631 stars 19 forks source link

Issue sending JSON and then audio Watson Speech to Text service #5

Closed nfriedly closed 8 years ago

nfriedly commented 8 years ago

Hey, this is a followup to the comments I left on hacker news. I'm trying to send an opening JSON message and then audio data to the Watson STT service. It was suggested that something like this would work:

arecord -fdat | cat start.json - | wsta 'wss://stream.watsonplatform.net/speech-to-text/api/v1/recognize?watson-token=...

(Getting a token requires some fiddling around with bluemix for credentials and converting it to a token via either curl or your favorite SDK... or just go to the demo, open the dev console, and grab one - they're reusable for a short period of time. OR, if wsta supports custom headers, you can stick the credentials into a basic auth header and skip the token.)

And, start.json looks like this:

{"action":"start","content-type":"audio/wav","interim_results":"true"}

However, when I do that, I get a ton of error: stream did not contain valid UTF-8 messages, and then the normal {"state": "listening"} message that acknowledges my initial JSON, and then a "No JSON object could be decoded" error.

My best guess is that wsta is correctly marking the opening JSON as a UTF-8 message, and then incorrectly marking all of the audio data as UTF-8 messages also. Does this sound likely? Is that reasonably easy to fix?

FWIW, the service also expects a closing JSON message at the end.. but that's not nearly as important because it will automatically kill the connection after 30 seconds of silence.

eliihen commented 8 years ago

Hi! Many thanks for the detailed report, I managed to reproduce without much hassle.

Actually, it is a bit simpler than that. The problem occurs here. It seems that rust's read_line assumes that input is UTF-8, and throws an error if this is not the case. I guess this is because the String struct is UTF-8 encoded by default.

I'll research if there is some other way to handle this.

Also: Way cool idea you have here. Love it.

nfriedly commented 8 years ago

Aah, I see. Thanks for looking into it and thanks for the props :)

eliihen commented 8 years ago

So, I've looked into the matter. It indeed possible, I just need to figure out a neat way to implement it.

Do you by any chance know how the watson service determines how much data to put into each frame? It obviously does not make sense to use line breaks to separate frames in binary data. Have they simply decided to put 5.8KB in each chunk and send that in a frame? It looks like it, looking at the image below. In which case, I'm thinking an API like --binary=5800 to request wsta to be in streaming binary mode with 5800 bytes in each frame.

spectacle f24470

That API has a problem, however, and that problem is the fact that you want to send a JSON message first and last. It would not make sense to have the same 5800-byte limitation there. Maybe a better option is to try to parse everything as UTF-8 and then fall back to binary if that fails. That has it's own issues, implementation-wise, however.

Suggestions or ideas are welcome. I'll get back to you when I know more.

nfriedly commented 8 years ago

I believe the Watson service can handle a pretty wide range of frame sizes, but I think the demo works on 8192-sample buffers of 16-bit mono audio, so 128kb 131,072 bits which is 16,384 bytes which is 16 kilobytes per frame. (sorry, mixed up the bits/bytes at first)

But, yea, switching between text and binary is the tricky part; I'm not sure what to suggest there.

Although looking closer at your screenshot, I do see the 5.8KB chunks.. not sure what decided that.

eliihen commented 8 years ago

I messed around a bit more with the binary feature, and now I have something I can show. wsta can now stream binary data just fine.

I did not succeed in getting the watson service to respond, however. Are you pre-processing the audio in any way before it is sent over the wire?

If you want to try it, you can have a look at the binary-data branch and run the following example. These options are not documented yet, but -b, --binary is a switch which turns wsta into binary mode where it sends at most 256 bytes at a time. You can override that size with WSTA_BINARY_FRAME_SIZE.

 env WSTA_BINARY_FRAME_SIZE=16384 cargo run -- --binary 'wss://stream.watsonplatform.net/speech-to-text/api/v1/recognize?watson-token=...' "$(cat start.json)" < audio-file.wav
nfriedly commented 8 years ago

oh, instead of arecord -fdat, try arecord --format=S16_LE --rate=44100 --channels=1.

And, then, I think WebSockets differentiate between UTF-8 and binary messages, and the opening JSON one may have to be UTF-8.

I'm kind of slammed right now, so it might be a few days before I can test anything :/

eliihen commented 8 years ago

Oh wow, it works! This is pretty much the coolest thing I have seen all month! Thanks for the updated arecord, that fixed the issue right up.

I'll release a new version soon, so that you can enjoy this through your normal distribution channel when you feel like it.

$ arecord -D hw:3,0 --format=S16_LE --rate=44100 --channels=1 | env WSTA_BINARY_FRAME_SIZE=16384 target/release/wsta -b 'wss://stream.watsonplatform.net/speech-to-text/api/v1/recognize?watson-token=...'  "$(cat start.json)" | jq .results[0].alternatives[0].transcript
Recording WAVE 'stdin' : Signed 16 bit Little Endian, Rate 44100 Hz, Mono
Connected to wss://stream.watsonplatform.net/speech-to-text/api/v1/recognize?watson-token=...
null
"hello "
"hello this is me "
"hello this is me talking to "
"hello this is me talking to people "
"hello this is me talking to people "
"%HESITATION "
"all my "
"all my "
"well my "
"well my go through "
"well my go this "
"hold my ground this action "
"hold my ground this actually worked "
"hold my ground this actually works "
"or more "
"or more ago "
"well I got on the "
"well I got on the details "
"or more you know to tell someone about this "
"or more you know to tell someone about this "

As you can see, my reaction was pretty much summed up in hold my ground this actually worked.