alexylem / jarvis

Jarvis.sh is a simple configurable multi-lang assistant.
http://openjarvis.com
MIT License
810 stars 197 forks source link

Nouveaux moteur STT avec Bing Speech API #49

Closed Smanar closed 8 years ago

Smanar commented 8 years ago

Bon j'ai rien teste encore, mais avant que j'oublis, moteur de bing/microsoft https://www.microsoft.com/cognitive-services/en-us/speech-api

5,000 transactions per month for free.

physicien commented 8 years ago

Je ne sais pas pourquoi, mais voir Bing devant un produit ne m'inspire pas une très grande confiance... C'est pas comme si le moteur de recherche du même nom ne fonctionnait pas vraiment :laughing:

Smanar commented 8 years ago

Lol, de toute facon, je ne l'ai pas encore teste, mais ca va etre dur de faire mieux que celui de google. On commence a crouler sur les moteurs, c'est vraiment le truc a la mode. Moi pour le moment je reste a celui de google, et j'attend celui d'amazon (alexa), qui devrait sortir en FR d'ici un an d'apres leurs dires.

physicien commented 8 years ago

Pour ma part, Wit est tout à fait correct pour le moment, et la configuration de Kaldi sur un serveur local progresse.

HoLengZai commented 8 years ago

Hi @alexylem,

I have just finished a Python script to use Bing Speech API as I cannot get a Google Speech API key (I don't have to use the new one with Cloud Speech API as i need to create a billing account)

The Microsoft documentation is not really good but I succeed in getting back my speech to text with a "nice" result in json and xml format I tried in english, french, chinese, dutch (bad result with that one but i suppose it's still in beta for this language)

Here a sample of my result by reading your github jarvis homepage: pi@raspberrypi:~/jarvis $ ./bingspeech.py

The body data: 
Oxford Access Token: {"access_token":"eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJhcGltLXVzZXItaWQiOiJmZmFmNGQxZWRjMWY0YzgzYjFmZDQwZTkyYWE0YjE2YyIsImFwaW0tc3Vic2NyaXB0aW9uLWlkIjoiOWM4NGEzNGE4ODUwNGJjYmExOTJkYzhiOWVjODdjMTUiLCJhcGltLXVzZXItZW1haWwiOiJ2aW5jaW1vdXNlQG91dGxvb2suY29tIiwiYXBpbS1rZXkiOiI2NGRjYTVmOWY5ZmM0NWE1OGVmYmM5MGU5ZDMwNzJiOCIsImNsaWVudC1pZCI6IjY0ZGNhNWY5ZjlmYzQ1YTU4ZWZiYzkwZTlkMzA3MmI4Iiwic2NvcGUiOiJodHRwczovL3NwZWVjaC5wbGF0Zm9ybS5iaW5nLmNvbSIsImlzcyI6InVybjptcy5veGZvcmQiLCJhdWQiOiJ1cm46bXMuc3BlZWNoIiwiZXhwIjoxNDY5Mzc2MDE5fQ.jsDbr9sfwXDpJynC9zYSp4gok4sf0Bn2cvyPnI4cw7g","token_type":"jwt","expires_in":"600","scope":"https://speech.platform.bing.com"}
200 OK
b'{"version":"3.0","header":{"status":"success","scenario":"ulm","name":"jarvis dot SH it\'s a lightweight if you\'re able utiline jarvis like about windows phone home automation running on slow computers example raspberry pi it install automatically speech recognition in sentences engine of your choices","lexical":"jarvis dot SH it\'s a lightweight if you\'re able utiline jarvis like about windows phone home automation running on slow computers example raspberry pi it install automatically speech recognition in sentences engine of your choices","properties":{"requestid":"818d8000-9c1f-48ab-80c2-99bd676ebeb5","HIGHCONF":"1"}},"results":[{"scenario":"ulm","name":"jarvis dot SH it\'s a lightweight if you\'re able utiline jarvis like about windows phone home automation running on slow computers example raspberry pi it install automatically speech recognition in sentences engine of your choices","lexical":"jarvis dot SH it\'s a lightweight if you\'re able utiline jarvis like about windows phone home automation running on slow computers example raspberry pi it install automatically speech recognition in sentences engine of your choices","confidence":"0.8036776","properties":{"HIGHCONF":"1"}}]}'
alexylem commented 8 years ago

@LengZai Wow fantastic, the transcoding is not perfect but I don't know how good your mic / environment / accent are. Is it possible you share it? I may integrate it on Jarvis if you agree

HoLengZai commented 8 years ago

Done! I made it work! Well, the transcoding is not perfect maybe due to my accent :-P

A bit dirty because i didn't want to touch your main code... so i copied the google folder and modify like this to create a substitution... So the "modded" main.sh of google sst_engines looks like this... I was lazy to do it in bash script but i'm sure we can adapt python to bash if we don't want to keep python (Python 3 but i think i found the way to make it work with python2)

#!/bin/bash
_google_transcribe () {
    json=`stt_engines/google/bingspeech.py $audiofile`
    $verbose && printf "DEBUG: $json\n"
    echo $json > $forder
}

google_STT () { # STT () {} Listen & transcribes audio file then writes corresponding text in $forder
    LISTEN $audiofile
    _google_transcribe &
    spinner $!
}

How to share the code with you?

HoLengZai commented 8 years ago

Wow?! I just got your update for the config folder... the update deleted my modifed main.sh for bing. I believed it will be OK to update because on one of your video you mentioned that the setup will ask me to merge or not... Fortunately i put it on this conversation just before getting your update lol and the udpate didn't delete the bingspeech.py file.

alexylem commented 8 years ago

Jarvis updates makes sure the original system files are the correct ones, so your modded version of google_stt got "fixed" 😄 your own custom files are preserved (like if you had done a bing folder until it is in Jarvis repo). There is no merge anymore (at least not for the config), Jarvis got improved since the videos. Please attach bingspeech.py in this ticket. I will try to recode it in bash to limit dependancies.

HoLengZai commented 8 years ago

Ah ok.. didn't know we can do that on github.. Cool

But... i tried to attach the zip (compress with Windows, Winrar and 7zip)... always get the same error

image

Edit: Ok i put on this link: http://dl.free.fr/fXYqy7BqO password: jarvis

Since with Bing Speech API we can have 5000 requests per month... If you want to save time... I can provide my subscription key (by PM, email?) except if you have already had one. In all the case, very easy and fast to get one (not like Google Cloud Plateform) image

Don't know why... sometime (rare) i got this message on the console:
Traceback (most recent call last):
  File "stt_engines/google/bingspeech.py", line 65, in <module>
    print(jvalue[0]['name'])
KeyError: _something_
?.

I think it's when i reach more than 20 queries within 1min or when Bing Speech response me with bad result... So you will see on my code that i catch when i don't get the json with "results" For sure we can improve that part to make it better. There is also some options on Bing to accept "bad words" I will try to improve it soon..

Thanks, hope it can help you

alexylem commented 8 years ago

I re-scope this ticket to STT only for now

alexylem commented 8 years ago

Very well documented code @LengZai, good job! I'll create my own key (because I need to experiment & document it for Jarvis users), test it, and start to re-code it in bash (if possible).

HoLengZai commented 8 years ago

Ok, i think i got why sometime i got the error message on the console.. I remember that sometime i got NOSPEECH value in the json response So we can definitively handle the following error messages return by Bing Speech API to make Jarvis more clever based on the Bing response

JSON-text: version header // result must be always and only present when status = "success" version: string // The API version "3.0" — the value the client passed and consequently the API version that serviced the request. header: {status scenario properties} // name and lexical are always and only present in the case of "status" = “success” status: "success"/ "error"/ "false reco"// 'false reco' is returned only for 2.0 responses when NOSPEECH OR FALSERECO is 1. This is done to maintain backward compatibility. scenario: string // the scenario this recognition result came from. name: string // formatted recognition result. Profane terms are surrounded with tags. lexical: string // text of what was spoken. Profane terms are surrounded with tags. properties: {requestid } requestid: string // this is a uuid identifying the requestid to be sent with the associated logs. It should be used as the "server.requestid" parameter value in the subsequent logging API request. NOSPEECH: 1 // set when no speech was detected on the sent audio. FALSERECO: 1 // set when no matches were found for the sent audio. HIGHCONF: 1 // set when the header result is determined to be of high-confidence. MIDCONF: 1 // set when the header result is determined to be of medium-confidence. LOWCONF: 1 // set when the header result is determined to be of low-confidence. ERROR: 1 // set when there was an error generating a response. results: [{scenario name lexical confidence pronunciation tokens}*] // n-best list of results ordered by confidence. scenario: string // the scenario this recognition result came from. name: string // formatted recognition result. Profane terms are surrounded with tags. lexical: string // text of what was spoken. Profane terms are surrounded with tags. confidence: float // floating point number indicating the result confidence (from 0.0 to 1.0 with 1.0 being the maximum confidence level. Example: 0.876534) properties: {} HIGHCONF: 1 // set when the header result is determined to be of high-confidence. MIDCONF: 1 // set when the header result is determined to be of medium-confidence. LOWCONF: 1 // set when the header result is determined to be of low-confidence.

alexylem commented 8 years ago

Okay, maybe noisy room / mic quality, can happen. In this case, according to your extract, I'll test status = success, else ?

HoLengZai commented 8 years ago

I was thinking to do it in bash script too :-P.... let me know your progress.. You will have my support... I'm quite sure it's possible as you just need to save the access_token before sending to Bing Speech API... So compared to Google... we need to send 2 requests instead of one. But i think we can also speed up the process.. (need to try) because the access_token has a 600 expired value (second, millisecond?) or maybe MS detect that the token has been already used so we need to generated it again... As mentioned... need to test that

HoLengZai commented 8 years ago

Yes Indeed @alexylem

alexylem commented 8 years ago

Step 1 curl to get the Token: ✅ Working

https://www.microsoft.com/cognitive-services/en-us/subscriptions

image

key_1="***************************"
key_2="***************************"

curl -X POST "https://oxford-speech.cloudapp.net/token/issueToken" \
     -H "Content-Type: application/x-www-form-urlencoded" \
     -d "grant_type=client_credentials" \
     -d "client_id=$key_1" \
     -d "client_secret=$key_2" \
     -d "scope=https://speech.platform.bing.com"

gives:

{"access_token":"eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJhcGltLXVzZXItaWQiOiIwNjQzZWEyMzRiYzU0YTgzODM1MjMwNWZiYjY2YmRhNSIsImFwaW0tc3Vic2NyaXB0aW9uLWlkIjoiNWViNzIwYjVlZmY1NDljNDkyOTdmZmFmZWMzYzg2YjgiLCJhcGltLXVzZXItZW1haWwiOiJhbGV4YW5kcmUubWVseUBnbWFpbC5jb20iLCJhcGltLWtleSI6IjY5YzhmOWFjOGU2YTQyMTBiNDY3MDc0MWM1ZjVmNTQwIiwiY2xpZW50LWlkIjoiYmM0ZmFlNGZiMTIzNDVmOGFjMTg2MWJiOWVhOWJjZGMiLCJzY29wZSI6Imh0dHBzOi8vc3BlZWNoLnBsYXRmb3JtLmJpbmcuY29tIiwiaXNzIjoidXJuOm1zLm94Zm9yZCIsImF1ZCI6InVybjptcy5zcGVlY2giLCJleHAiOjE0Njk0Njc1MzR9.PWjWOh2Fkr_6MHYFMTpw7tY5a-iqb-EgaiALCvyAeWQ","token_type":"jwt","expires_in":"600","scope":"https://speech.platform.bing.com"}

EDIT: I share because it was not easy to find on the web, hope it will help others.

Smanar commented 8 years ago

The expiration time is in seconds.

alexylem commented 8 years ago

Yes I plan to use it to avoid useless calls to the token API (deducted from the quota!)

alexylem commented 8 years ago

😞 struggling to curl Bing's recognize api (now that I have the token), I always get:

{"version":"3.0","header":{"status":"error","properties":{"requestid":"0cb1c3fa-c743-4991-9e93-3abc7d38d212"}}}

Maybe an issue with the file format I upload... (16k wav recorded with rec)

alexylem commented 8 years ago

😄 Ok I managed to get it working, It was indeed the encoding of the wav file...

Here is the working code in case someone else is looking for this (I have global variables, replace with appropriate values):

request="https://speech.platform.bing.com/recognize/query"
request+="?version=3.0"
request+="&requestid=`uuidgen`" # generated
request+="&appid=D4D52672-91D7-4C74-8AD8-42B1D98141A5"
request+="&format=json"
request+="&locale=$language" # en-US
request+="&device.os=$platform" # osx
request+="&scenarios=ulm"
request+="&instanceid=E043E4FE-51EF-4B74-8133-B728C4FEA8AA" # jarvis instance id

curl "$request" \
-H "Host: speech.platform.bing.com" \
-H "Content-Type: audio/wav; samplerate=16000" \
-H "Authorization: Bearer $stt_bing_token" \ # token generated see post above
--data-binary "@$audiofile" \ # test.wav
--silent --fail

which gives:

{"version":"3.0","header":{"status":"success","scenario":"ulm","name":"hello","lexical":"hello","properties":{"requestid":"0a64b573-da28-41ba-be00-2381dcdacc6c","HIGHCONF":"1"}},"results":[{"scenario":"ulm","name":"hello","lexical":"hello","confidence":"0.9498509","properties":{"HIGHCONF":"1"}}]}

If I don't face any new pb, Bing should come up in your Jarvis some time tomorrow 😉

alexylem commented 8 years ago

Ca y est! Bing est désormais disponible dans la mise à jour de Jarvis, veuillez le sélectionner via: Settings > Voice recognition > Recognition of commands

image

Comparatif des moteurs de reconnaissance vocale mis à jour: https://github.com/alexylem/jarvis/wiki/stt

image

Le choix recommandé sera changé d'ici quelques jour après les retours de la communauté des utilisateurs de Jarvis 😄

Nouvelle page expliquant comment se procurer les clés Bing: https://github.com/alexylem/jarvis/wiki/bing

HoLengZai commented 8 years ago

Excellent Alex. Sorry, difficult for me to develop during the weekdays. Please don't forget that there is 2 optional option for Bing Speech API. The best n match and mean words... I will check your github posts tonight

Thanks

alexylem commented 8 years ago

I checked the doc and chose to use the name from the header. I may disable profanity checker in a further release.

echo $json | perl -lne 'print $1 if m{"name":"([^"]*)"}'

My philosophy is to push new features as soon as they seem working, then I improve them based on community feedback 😄

Smanar commented 8 years ago

Realy usefull ^^, thx. Another thing that could be usefull, google support only Flac (44100) and PCM(16000) but we can use other sample rate on Bing, not forced to use the 16000 sample rate.

alexylem commented 8 years ago

Because of its simplicity, Jarvis uses the same encoding for all Speech to Text engines. So the same recorded wav could be sent to either google, pocketsphinx, wit and bing. Fortunately there is a common format they all support 😄

Smanar commented 8 years ago

Ha yes, right.

Smanar commented 8 years ago

I have made a realy fast try (not a lot free time this week) and for the moment I have better result with bing engine. For example try the string "Deconnectes toi" on both. What is your result ?

HoLengZai commented 8 years ago

Hi @Smanar,

I'm going to try it now..

@alexylem ... I have just updated jarvis with Bing... Thanks :-) FYI: in the Voice recognition menu you put Bing key1 and key2 but you should put Bing key only and as i mentioned on my source code comments, use the same for both so if you renew one of them, you won't impact jarvis. Microsoft provides 2 keys if you have different projects or if you want to lend one key to someone for testing before renew it. Hence the config menu will be simpler by putting only one key

Thanks

alexylem commented 8 years ago

@LengZai During my tests I came to the wrong conclusion we had to use both. Let me test with 1 and if indeed it works I'll update it as suggested.

alexylem commented 8 years ago

damned!! it works 😄 Updating the code right now... you will have to re-enter your key because I will rename the internal variable from bing_key1 to bing_key (I like code consistency)

HoLengZai commented 8 years ago

Haha, no prob... It was what i discovered during my test in Python... As all the example on Internet they use the previous Bing Speech API with Project Oxford... As MS changed it to make it "more simple" ... yes indeed... only one key is needed :) Thanks for the (future) update

alexylem commented 8 years ago

The update is done already 😄