OpenASR / idiolect

🎙️ Handsfree Audio Development Interface
https://arxiv.org/pdf/2305.03089.pdf
Apache License 2.0
93 stars 10 forks source link

Replace CMUSphinx #52

Closed breandan closed 1 year ago

breandan commented 6 years ago

Integrate mozilla/deepspeech ~once real-time streaming inference is implemented~. Depends on mozilla/deepspeech#847, mozilla/deepspeech#1275

edit: It appears that DeepSpeech now supports a streaming API with Java bindings, but only for Android. JVM bindings were unfortunately rejected.

breandan commented 5 years ago

PicoVoice might be a better candidate for replacing CMUSphinx, given the complexity of DeepSpeech integration and added latency of running a HTTP server. According to @kenarsa, it looks like they're planning to release a STT demo shortly?

kenarsa commented 5 years ago

That is the plan. We are aiming for an initial release within the next month.

nalbion commented 4 years ago

I've made a start on this: https://github.com/OpenASR/idear/commit/a945ad405eb7e5a69d5f3abf0c306966d93372d2

breandan commented 4 years ago

@kenarsa Can you clarify what features the Picovoice license supports for open source projects like ours? In particular, it looks like Cheetah has some restrictions around using custom vocabularies:

Allows adding new words and adapting to different contexts (Available only under the commercial license).

Are there plans to allow open source projects to train their own models or deploy models trained via the Picovoice console? Any guidance you can provide for our use case (small to medium vocabulary custom language models) would be highly appreciated. Thanks!

nalbion commented 4 years ago

Actually @kenarsa, I am a bit concerned about the limitations of Picovoice:

5 contexts, each with a limit of 50 unique words and 150 total words, 5 intents, and 5 slots.

https://picovoice.ai/console/

IntelliJ has a large number of actions that could potentially be invoked by speech-to-intent:

kenarsa commented 4 years ago

the license terms are set and we don't have plans to change them. For console you can look into the terms here: https://picovoice.ai/docs/terms-of-use/ and each product Github repo explains the license and limitations. Our offerings are geared towards mid-large scale enterprises. I hope it helps. Cheers!

breandan commented 4 years ago

@kenarsa Thank you for your reply. It seems like Picovoice primarily targets enterprise IoT and edge computing, where these license limitations make sense. Understand why these are necessary, but the operating system restrictions and lack of grammar support for noncommerical applications are not really compatible with our project's goals.

@nalbion It might make sense to focus our efforts on improving DeepSpeech for Java, which already offers Android bindings and seems to be a better fit for desktop applications. You might want to check out @GommeAntiLegit's Java library, or see if there is a way to work with mozilla/DeepSpeech#2166 to upstream an API for other JVM users.

edit: There is also vosk-api which is built on Kaldi (a much more widely used platform in the ASR community), and run by the former maintainer of CMUSphinx, @nshmyrev. It appears to have an early Java API and possibly some form of grammar support a la JSGF/CMUSphinx. Given Mozilla's hesitation to support JVM bindings it might be a better target.

nshmyrev commented 4 years ago

@breandan we have word lists you can simply use to improve accuracy like here in python . I might make you jars if you need them too, let me know the platforms you need to support.

breandan commented 4 years ago

Thanks, it looks like a promising replacement. Our goal is to generate contextual grammars for various IDE menus. This should be possible by listening for keywords, but it would be nice to support custom language models, using an approach like #6.

I might make you jars if you need them too, let me know the platforms you need to support.

That would be great. I was just creating a ticket here: alphacep/vosk-api#95. I guess Linux_x64 and whatever @nalbion uses for development would be a good place to start.

nalbion commented 4 years ago

I'm using Windows x64

nalbion commented 4 years ago

@breandan how would you provide the .so/.dll libraries with the plugin?

breandan commented 4 years ago

The way I’ve seen this done before is to bundle the .dll/.dylib/.so for all platforms in the same JAR, then use that JAR as a plugin dependency. I think this project does that, I can run it the same native library from a JAR on multiple platforms, as long as it matches one of the compiler targets.

breandan commented 4 years ago

If you’re compiling manually, you may need to determine the correct binary to use based on a syscall, I’m not actually sure how that works in JavaCPP. (e.g. in TraceJump I use some of the javacpp-presets and it works out of the box on Mac and Linux.)

breandan commented 4 years ago

Z3-TurnKey is a great example of how to build a single JAR with multiplatform native libraries bundled inside. If you inspect the JAR it contains prebuilt native binaries for OSX/Linux/Windows. Here is the logic for dispatching to the correct library at runtime. The Gradle build logic may also be helpful.

daanzu commented 4 years ago

I just saw your comment on Hacker News, and replied with a mention of my project, which may be useful to yours. Let me know if you want to work together!

breandan commented 4 years ago

@daanzu Thank you for reaching out! Your grammar swapping functionality does indeed look like what we were originally looking for #6. Since our tool is currently JVM based, we'll need a way to distribute a single artifact without asking users to install other dependencies (Python, Kaldi, etc.) Do you have any plans to publish a standalone REST/RPC server or external language bindings?

daanzu commented 4 years ago

@breandan Hmm, the current version of KaldiAG has a large Python component (the rest being C++, along with Kaldi itself). Python speeds development, and the pip infrastructure makes distributing the binaries (for all major platforms) easy. Other language bindings are still possible, but would involve a fair bit of implementation and maintenance work. However, I have been considering whether to try to move more of the implementation into C++.

On the other hand, adding a REST/RPC server and interface to the current Python package shouldn't be too hard. I don't really know much of anything about how JVM distribution works nowadays, especially regarding binaries. How would that work?

breandan commented 4 years ago

I don't really know much of anything about how JVM distribution works nowadays, especially regarding binaries. How would that work?

Ideally, KaldiAG would publish a headless, batteries-included executable for Windows OSX and Linux, and we would include all three and just launch the correct one based on the OS. There would need to be some way to swap grammars, but otherwise it shouldn't be too difficult.

The alternative is to publish native libraries which we could call into via JNI. This is probably more difficult to implement as most of the code is written in Python, but should be doable.