daanzu / kaldi-active-grammar

Python Kaldi speech recognition with grammars that can be set active/inactive dynamically at decode-time
GNU Affero General Public License v3.0
339 stars 50 forks source link

Alternate dictation source: Dragon / Natlink #23

Open daanzu opened 4 years ago

daanzu commented 4 years ago

I don't have experience with Natlink, and don't currently have Dragon installed, but I'd be happy to help implementing this.

Is there a way with Natlink to just get straight dictation recognition text from audio data passed to it?

dwks commented 4 years ago

For dragonfly, I think it would be something like this for "blah" or "dictate blah":

from dragonfly import *
class DictationRule(CompoundRule):
    spec = ('[dictate] [<dictation>]')
    extras = [Dictation(name='dictation')]
    exported = False

    def value(self, node):
        words = node.words()
        print 'You said "' + " ".join(words) + '"'
        #return Text(" ".join(words))  # causes chars to be typed
        return Text("")

grammar = Grammar("root rule")
grammar.add_rule(DictationRule())  # Add the top-level rule.
grammar.load()  # Load the grammar.

def unload():
    """Unload function which will be called at unload time."""
    global grammar
    if grammar:
        grammar.unload()
    grammar = None

See also https://github.com/dwks/aenea-grammar-simple/blob/master/words.py

For pure Natlink, I imagine you can use CatchAllGrammar to observe everything: https://github.com/dictation-toolbox/natlink/blob/master/SampleMacros/_repeatthat.py

You can import <dgndictation> if you want to get straight dictation matches only: https://github.com/dictation-toolbox/natlink/blob/master/SampleMacros/_sample8.py

Here's an example of capturing all natlink rules and saving audio plus transcript, it might be helpful: https://github.com/dwks/dragonfly-save-audio/blob/master/_natlink_save_audio.py

[edit] I can help with testing but @Danesprite is the real expert in this area. [/edit]

drmfinlay commented 4 years ago

Hello @daanzu and @dwks. Natlink has a file transcription function that could help you to do this: natlink.inputFromFile(). Unfortunately, it isn't working properly for me. It looks like Dragon itself isn't accepting the file I've specified.

Anyway, according to its documentation, Dragon is supposed to take its input from the specified wave file instead and process it just like it processes audio from the mic. So if the function were working, you could use something like dwks' code in a separate process to get at the recognition results.

You could also use SAPI 5 for an alternate dictation source. It can accept file or object audio input streams according to Microsoft's AudioInputStream documentation. This would be a separate issue of course and maybe not worth the headache of working with COM.

daanzu commented 4 years ago

Thanks for the info!

FYI, the alternate dictation interface currently allows for performing the recognition on just parts of an utterance: for example "say hello world", where only "hello world" is the dictation and "say" is the command portion.

That inputFromFile function sounds very nice. Too bad it doesn't seem to be working. Definitely being able to pass audio directly would be ideal, either directly or by file, rather than letting Dragon handle the microphone. Also I am not sure what kind of problems might arise if we need to be in the Dragon process.

Ugh, don't remind me of SAPI AudioInputStream. I wasted weeks of my life banging away at it, in an attempt to make a more flexible microphone interface to WSR, but the result was far too janky to be practical. Microsoft didn't make it very workable. Although that was for real time use; perhaps the file method would be better or at least workable. However, from what I've heard, I don't know that WSR's accuracy would be desirable.

shervinemami commented 4 years ago

Have you guys heard of DragonBench (http://www.rwilke.de/dragonbench/)? It can load audio files into Dragon, I'm not sure how it does it, and it's not open source, but I've chatted with the author before, he's quite friendly so he could be willing to explain how he sends data to Dragon. DragonBench has been around for some years, so it's probably using a relatively old (and reliable) technique that works also on older versions of Dragon.

drmfinlay commented 4 years ago

Nice! I didn't realise your alternate dictation interface allowed for that.

That's a shame SAPI AudioInputStream doesn't really work properly. Though as you said, WSR's accuracy isn't great. Probably not worth the effort.

I did manage to get the inputFromFile function kind of working, but ran into further complications when it caused a segfault. Despite that, Dragon did process the recorded audio and quickly too! It seems that when it failed the first time, it wanted a wave file with 16kHz 16bit mono audio.

For reference, this is the code I used in a separate Python script:

import natlink

natlink.natConnect()
try:
    natlink.inputFromFile("testing.wav", 0, [], 0)
finally:
    natlink.natDisconnect()

@shervinemami DragonBench looks pretty neat. It could be using the same Dragon COM interfaces that Natlink uses internally for inputFromFile. This function is probably not used much. It doesn't segfault with DNS 10, so the code probably just needs to be updated for more recent versions. That would be an issue for one of the Natlink maintainers though.

BTW I wouldn't recommend using inputFromFile in-process via natspeak.exe because the segfault will render Dragon unresponsive.

daanzu commented 4 years ago

@Danesprite Thanks for trying to get it to work! 16kHz 16bit mono is perfect, since that is what Kaldi is using/recording already.

drmfinlay commented 4 years ago

No worries. It does almost work properly. You could ask about the function in the natlink gitter channel.

drmfinlay commented 4 years ago

I wrote a natlink_file_input.py workaround script for using inputFromFile. It should be run as a subprocess expected to crash. The Windows segmentation/page fault error window that would normally appear is suppressed by use of the Win32 SetErrorMode function.

Despite how much of a hack it is, it works pretty well. Sometimes I need to turn the microphone off before the file's audio is processed, which is not a big deal. I hope it's useful :-)