dwks / silvius

Kaldi-based speech recognition system + grammar
http://voxhub.io/silvius
BSD 2-Clause "Simplified" License
100 stars 28 forks source link

Automatically add rules for all terminals to specific, annotated rules. #29

Open jvanloov opened 5 years ago

jvanloov commented 5 years ago

Words used as terminals in rules don't match the ANY token type, which makes them unusable in "plain English" contexts ("word", "phrase", "sentence"). This becomes a problem when the number of rules increases, and more and more words become "reserved" as tokens. This limitation can be overcome by adding explicit rules, as was already done for the numbers "one", "two" etc in the "raw_word" rule.

To remove the tedium of adding these rules, the commit in this PR adds some logic to handle this automatically.

Because spark uses docstrings to specify rules, and docstrings cannot be appended to at runtime, a function decorator is used. Additionally, collecting the tokens is done most easily on an already-instantiated parser (the implementation for this was already present, too). Finally, it was not clear to me if it is possible to instruct spark to revisit the docstrings and regenerate its set of rules.

The easiest way to work with the docstring limitation was to add a step to the setup process:

  1. instantiate parser
  2. collect keywords
  3. instantiate parser again, augmented with rules auto-generated for the keyword list collected in 2. (the parser instantiated in 1. is discarded.)

(issue #19 describes this issue as well.)

dwks commented 5 years ago

This is a tricky problem, to programmatically generate the doc strings. I like your solution, though eventually it would be best to not instantiate the grammar twice of course. Before I merge this, can you verify that grammar/lm.py still runs? It probably also needs to instantiate the grammar twice. This script is used to generate all n-gram sequences that can occur in the grammar, so that the speech system can boost the probabilities of those sequences in its language model (requires offline training).

dwks commented 5 years ago

P.S. Some command words, like "comma" or "ctrl", should be allowed in an english context. For example: "sentence hello world bang control sierra". At least, this is how I envisioned the system. That's why I was explicitly listing tokens like "one" that would be recognized by the lexer, but would not actually be the start of a valid command. Not sure about the behaviour of your patch in this case, just something to think about.

jvanloov commented 5 years ago

lm.py: some small additional modifications are needed to make this work; I'm looking into it.

Currently on the master branch, the "word", "phrase" and "sentence" commands will fail if commands are used (apart from "one"..."nine"). With this patch, all command words will be treated like ANY words in the "word", "phrase" and "sentence", i.e. they will be spelled out in the output.

By "[...] should be allowed in an english context. [...]", I assume you mean "should be allowed to be used as commands in an english context" (e.g., "sentence hello world bang" should come out as "Hello world!")

I see why you'd want to allow certain commands inside "phrase" and "sentence" (and for "word", although personally I think I'd want to use the "word" rule to be able to spell out reserved words.)

My current thinking is to add an exclusion list to the function decorator for this, so you can tell the system for which tokens it should not auto-create rules. Tokens for new rules will still automatically be picked up, unless you specifically indicate in the exclusion list(s) that you intend to do something special with it in that rule/context.

jvanloov commented 5 years ago

lm.py works.

I moved find_terminals to the parser, and rearranged the init code a bit, so that now the parser doesn't need to be instantiated twice.