this is neat! you could consider wrapping pliers (https://github.com/tyarkoni/pliers ), which would allow you to easily switch between different speech transcription services (Google, IBM, http://Wit.ai , etc.)--though at the cost of a few dependencies.
Idea from Tal Yarkoni (via Twitter):