Crucial for our demo: grammar too big?

TheresaSchmidt commented 5 years ago

Describe the bug We have created a very big grammar. In Silent Mode the Sphinx node works just fine but in speech recognition mode, our dialogue breaks down when loading the speech recognition. This leads to DialogOS crashing completely. I am attaching a minimal example with the exact same grammar which shows similar behaviour. Instead of breaking down, however, it actually shows an error message: huge_grammar_error

The error.log file seems (to me) to be the same for our dialogue and the attached example. For the example it starts like this:

WARNING: Unexpected XML protocol element "messages"
WARNING: Unexpected XML protocol element "version"
ExtensibleDictionary
Fehler im Knoten
   anleitung_zutaten
in huge_grammar.dos
java.lang.RuntimeException: Allocation of search manager resources failed
    at edu.cmu.sphinx.decoder.search.SimpleBreadthFirstSearchManager.allocate(SimpleBreadthFirstSearchManager.java:651)
    at edu.cmu.sphinx.decoder.AbstractDecoder.allocate(AbstractDecoder.java:103)
    at edu.cmu.sphinx.recognizer.Recognizer.allocate(Recognizer.java:164)
    at edu.cmu.lti.dialogos.sphinx.client.ConfigurableSpeechRecognizer.<init>(ConfigurableSpeechRecognizer.java:25)
    at edu.cmu.lti.dialogos.sphinx.client.SphinxContext.getRecognizer(SphinxContext.java:84)
    at edu.cmu.lti.dialogos.sphinx.client.Sphinx.startImpl(Sphinx.java:43)
    at com.clt.speech.recognition.AbstractRecognizer.startLiveRecognition(AbstractRecognizer.java:136)
    at com.clt.speech.recognition.AbstractRecognizer.startLiveRecognition(AbstractRecognizer.java:111)
    at edu.cmu.lti.dialogos.sphinx.plugin.SphinxRecognitionExecutor.lambda$start$0(SphinxRecognitionExecutor.java:43)
    at java.util.concurrent.FutureTask.run(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)
Caused by: java.io.IOException: bad base grammar URL data:#JSGF V1.0 UTF-8;

grammar null;

<next> = 
   was ist der nächste Schritt
 | nächster Schritt
 | weiter
;

Then there's the rest of the grammar. It ends like this:

/.gram
    at edu.cmu.sphinx.jsgf.JSGFGrammar.commitChanges(JSGFGrammar.java:274)
    at edu.cmu.sphinx.jsgf.JSGFBaseGrammar.createGrammar(JSGFBaseGrammar.java:293)
    at edu.cmu.sphinx.linguist.language.grammar.Grammar.allocate(Grammar.java:112)
    at edu.cmu.sphinx.linguist.dflat.DynamicFlatLinguist.allocate(DynamicFlatLinguist.java:189)
    at edu.cmu.sphinx.decoder.search.SimpleBreadthFirstSearchManager.allocate(SimpleBreadthFirstSearchManager.java:647)
    ... 12 more

To Reproduce Please attach a minimal example dialog exposing the bug if applicable. Steps to reproduce the behavior:

Run this dialogue: huge_grammar.zip
The error appears.
(Run the dialogue in Silent Mode. There will be no error. For example, the grammar recognizes Einkaufszettel but reports No match for recognition result for Hallo.)

Expected behavior The behaviour described in step 3. should not only happen in Silent Mode but also with speech recognition.

Installation information

OS: Windows 7
Version 2.0.5
Installer

alexanderkoller commented 5 years ago

@timobaumann As I understand the code in the Sphinx plugin, you are sending the grammar to Sphinx in a data: URL. Is it possible that there is a length limit on data: URLs? Would it be feasible to write the grammar to a temporary file and pass a file URL instead?

alexanderkoller commented 5 years ago

Also: I have observed this bug today. When @TheresaSchmidt says "DialogOS crashing completely", she means that DialogOS becomes unresponsive and needs to be terminated via task manager.

TheresaSchmidt commented 5 years ago

@timobaumann As I understand the code in the Sphinx plugin, you are sending the grammar to Sphinx in a data: URL. Is it possible that there is a length limit on data: URLs? Would it be feasible to write the grammar to a temporary file and pass a file URL instead?

I have no insight into this but I've been thinking. Does the URL hypothesis seem plausible considering the fact that everything works just fine in Silent Mode?

alexanderkoller commented 5 years ago

Yes, in "silent mode", the Sphinx speech recognizer is not used at all.

TheresaSchmidt commented 5 years ago

Oh, now I get it :)

timobaumann commented 5 years ago

@alexanderkoller yes, that's a plausible hypothesis. Although the limit should really be very high. We should have a unit test that tests recognition with successively larger grammars to find it.

akoehn commented 5 years ago

newGrammarFromJSGF is currently used with a URL as input, but it could also be called with an InputStream or a Reader. In fact, the version taking a URL first converts that. It would be great to convert the data to one of those classes and use that rather than the current String -> data-URL -> BufferedInputStream -> InputStreamReader conversion.

If only I knew how to actually pipe that through all the layers of sphinx :-/

FTR, I also tried to rewrite DataURLHelper.encodeData to write to a temp file and return a URL to that one but some part of sphinx tries to add a ".gram" to the URL and then (obviously) fails.

timobaumann commented 5 years ago

yes, that's the grammar reader in sphinx. How about naming your files ending in .gram and not saying that in the URL? :-)

akoehn commented 5 years ago

I could try that but using temporary files really seem like a hack to me and I can't find where the ".gram" is added to the URL to know I actually do the right thing.

I would prefer to use a stream so we don't generate lots of temporary files and have another surface for potential problems.

akoehn commented 5 years ago

It is in JSGFGrammar.grammarNameToURL and I don't really know how to handle that. It assumes weird things (going to the basedir ot the grammar, then searching for name + gram) and the only way I can imagine that the data-approach actually works is that it fails spectacularly and the catch block then does its thing.

All in all: I don't see an easy fix without knowledge about sphinx.

timobaumann commented 5 years ago

test in 8428981e518ebad0ccaec0ad8a1bb1dec67a6c40 works very nice with data URLs that hold 1MB of text. (They do grow slow at 4MB, to about 1/3 of a second per urlencoding/decoding.)

timobaumann commented 5 years ago

don't put % into your grammar (as in

 \&quot;Kokosmilch (9 % Fett)\&quot; 
 \&quot;Mozzarella (9 % Fett)\&quot; 
 \&quot;Quark (20 % Fett)\&quot; 
 \&quot;Sauerrahm (15 % Fett)\&quot; 
 \&quot;Schichtkäse (10 % Fett)\&quot; 
 \&quot;Schinkenwürfel (2 % Fett)\&quot; 
 \&quot;Schokolade (70 % Kakaogehalt)\&quot;

I didn't bother check whether other odd characters also make DialogOS tell you that your grammar is bad.

alexanderkoller commented 5 years ago

Thanks!

So to summarize, do I understand the situation correctly as follows:

The problem was not the URL length; the original DialogOS code works correctly for grammars that are megabytes in size.
Grammars cannot contain percent signs.

If this is right, then we should document this in the manual. Do you know why percent signs are a problem? Do they have special meaning in Sphinx? Are there other tokens that might plausibly cause similar problems?

timobaumann commented 5 years ago

I simply tested whether megabytes of content encoded in Data URLs survive. We'd have to check the size limit for Sphinx grammar parsing and construction but I'd expect it to be high. (After all, the search graph in a SLM easily has 10000s of entries at every branch and the people behind JavaCC probably know their business.)
I believe it's the percentage signs. However, @TheresaSchmidt could potentially also have stray " or ' somewhere in there which will definitely break things. @TheresaSchmidt , you could debug this by gradually adding more and more of your intended grammar until it breaks. You'll probably see the stray signs along the way (the % immediately caught my eye).
the JSGF specification actually allows percentage signs even for rule names. This may thus be an implementation issue in Sphinx' JSGFParser. Maybe "quoting" helps to resolve it. However, I'd strongly advise against anything that isn't easily pronounceable. To get quotes as part of token, you need to quote the token and escape the quote sign like "\"so\"" -- but why would you ever want to do that??
what's allowed is determined by https://github.com/cmusphinx/sphinx4/blob/master/sphinx4-core/src/main/java/edu/cmu/sphinx/jsgf/parser/JSGFParser.java and https://github.com/cmusphinx/sphinx4/blob/master/sphinx4-core/src/main/resources/edu/cmu/sphinx/jsgf/parser/jsgf.jj . (Looking at that again, it's more likely that some quoting went wrong in the example above because it appears to deal with %.)
essentially Sphinx was trying to say "there's something wrong with the grammar at <URL>. However, in our case the URL is a very long string that ends in /.gram (see below).
for reference: the path we're taking is via Sphinx' configuration management. That only allows us to re-set the basepath of the URL to the grammar, as well as the actual basename of the grammar file. Sphinx automatically constructs the URL to the grammar by taking basepath/basename.gram (see JSGFGrammar, the individual parts are strings).
the overall flow in Theresa's example is: Dialogos String variable -> DialogOS SRGS grammar representation -> string -> data-url -> string -> data-url/.gram -> inpustream (string-based) -> Sphinx JSGF grammar. No, I don't like it, it's just the only way I could make it work without changing Sphinx. The serialization from DialogOS seems to ignore the original ordering of rules in the grammar (but that shouldn't matter much.)
We simply leave the grammar name unspecified and exchange the URL against a data URL and calls its openstream. The data URL is robust against having /.gram added to the end and otherwise returns the content (either plain as we use it, or using base64-encoding). During decoding, weird stuff happens with + (url-encoding encodes both space and plus as plus) so we escape pluses to %2b.
@akoehn , if you want to shortcut this, you could come up with a new URL scheme (say, cache:) that creates IDs (hashes?) of grammars (or their inputstreams) that you store in a cache. Then, upon seeing the URL with your hash (plus potentially /.gram in the end), you return the inputstream.

dialogos-project / dialogos

Crucial for our demo: grammar too big? #173