Speech recognizer can't handle umlauts under Windows

alexanderkoller commented 5 years ago

On Windows, speech recognizer nodes throw an exception upon execution if a word contains an umlaut. To reproduce, create a speech recognizer node with a DirectGrammar and language=German, add an InputWord with an umlaut in it, and click on "Try Recognition".

The stacktrace and a screenshot are below.

@timobaumann , is there any chance that you can get access to a Windows machine and have a look at it urgently?

Fehler:
class edu.cmu.sphinx.jsgf.parser.TokenMgrError
edu.cmu.sphinx.jsgf.parser.TokenMgrError: Lexical error at line 6, column 6.  Encountered: "\u00b6" (182), after : ""

Details:
edu.cmu.sphinx.jsgf.parser.JSGFParserTokenManager.getNextToken(JSGFParserTokenManager.java:1187)
edu.cmu.sphinx.jsgf.parser.JSGFParser.jj_ntk(JSGFParser.java:998)
edu.cmu.sphinx.jsgf.parser.JSGFParser.item(JSGFParser.java:636)
edu.cmu.sphinx.jsgf.parser.JSGFParser.sequence(JSGFParser.java:559)
edu.cmu.sphinx.jsgf.parser.JSGFParser.alternatives(JSGFParser.java:473)
edu.cmu.sphinx.jsgf.parser.JSGFParser.RuleDeclaration(JSGFParser.java:438)
edu.cmu.sphinx.jsgf.parser.JSGFParser.GrammarUnit(JSGFParser.java:299)
edu.cmu.sphinx.jsgf.parser.JSGFParser.newGrammarFromJSGF(JSGFParser.java:122)
edu.cmu.sphinx.jsgf.parser.JSGFParser.newGrammarFromJSGF(JSGFParser.java:232)
edu.cmu.sphinx.jsgf.JSGFGrammar.loadNamedGrammar(JSGFGrammar.java:321)
edu.cmu.sphinx.jsgf.JSGFGrammar.commitChanges(JSGFGrammar.java:237)
edu.cmu.sphinx.jsgf.JSGFBaseGrammar.createGrammar(JSGFBaseGrammar.java:293)
edu.cmu.sphinx.linguist.language.grammar.Grammar.allocate(Grammar.java:112)
edu.cmu.sphinx.linguist.dflat.DynamicFlatLinguist.allocate(DynamicFlatLinguist.java:189)
edu.cmu.sphinx.decoder.search.SimpleBreadthFirstSearchManager.allocate(SimpleBreadthFirstSearchManager.java:647)
edu.cmu.sphinx.decoder.AbstractDecoder.allocate(AbstractDecoder.java:103)
edu.cmu.sphinx.recognizer.Recognizer.allocate(Recognizer.java:164)
edu.cmu.lti.dialogos.sphinx.client.ConfigurableSpeechRecognizer.<init>(ConfigurableSpeechRecognizer.java:25)
edu.cmu.lti.dialogos.sphinx.client.SphinxContext.getRecognizer(SphinxContext.java:82)
edu.cmu.lti.dialogos.sphinx.client.Sphinx.startImpl(Sphinx.java:40)
com.clt.speech.recognition.AbstractRecognizer.startLiveRecognition(AbstractRecognizer.java:140)
com.clt.speech.recognition.AbstractRecognizer.startLiveRecognition(AbstractRecognizer.java:111)
edu.cmu.lti.dialogos.sphinx.plugin.SphinxRecognitionExecutor.lambda$start$0(SphinxRecognitionExecutor.java:43)
java.util.concurrent.FutureTask.run(Unknown Source)
java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
java.lang.Thread.run(Unknown Source)

screenshot

timobaumann commented 5 years ago

I'm unassigning myself from non-utf-8-environment bugs, sorry.

timobaumann commented 5 years ago

can someone confirm on Windows that this fixes the issue? thank you. If so, we should probably also think about adding encodings to all other grammar.export() types.

timobaumann commented 5 years ago

I really don't get it. the code that is problematic is:

            case JSGFwithGarbage:
                if (out instanceof PrintWriter) {
                    w = (PrintWriter) out;
                } else {
                    w = new PrintWriter(new BufferedWriter(out));
                }
                w.println("#JSGF V1.0 UTF-8;");
                w.println();
                break;

it used to not contain the UTF-8 encoding remark. However, PrintWriter is supposed to use the system's default encoding; so the change I've implemented looks dangerous. In particular, I don't know if we're ever using the second code-path (the one with BufferedWriter()) and whether I've now broken that. Please, could someone double-check?

timobaumann commented 5 years ago

funny enough, this is happening before sphinx-4 is involved.

alexanderkoller commented 5 years ago

Is there any way to test this without building a new release of DialogOS? If I just do "gradlew run", there is no German recognizer model, right?

alexanderkoller commented 5 years ago

Another useful piece of info: It is possible to modify the default charset in Java after the JVM has been started, see http://araklefeistel.blogspot.com/2015/10/set-fileencoding-in-jvm.html

akoehn commented 5 years ago

Is there any way to test this without building a new release of DialogOS?

A simple hack would be to copy install4j/models/sphinx4-de-de-uhh.jar from dialogos-distribution to your dialogos directory and temporarily add it to your runtime dependencies as done in the dialogos-distribution build.gradle.

timobaumann commented 5 years ago

I just used the combination äöü. The problem was in the Grammar after all so it doesn't matter whether the lexicon will be able to pronounce it. If you want to test thoroughly, you can add a word with Umlaut to the pronunciation execption dictionary. E.g.: Motörhead -> M OW T ER HH EH D

timobaumann commented 5 years ago

I guess, what I wanted to write was: why use German? Any language will do.

alexanderkoller commented 5 years ago

Yes, I know. :) That's a good point.

The bug seems to be fixed (I tried an English DirectGrammar with umlauts on Windows 10). I will now make a test release and see if the installed version, with a German DirectGrammar, works too.

I do have one question: If I understand your fix correctly, DialogOS generates a grammar for Sphinx, and after the fix says explicitly in the grammar file that the grammar is encoded in UTF-8. But given that the default character encoding under Windows is CP1252 and we use the default encoding in the Writer, why is the grammar file actually encoded in UTF-8?

What are your feelings towards enforcing UTF-8 as the default encoding in DialogOS? The code I linked to in my comment above seems to be working correctly, and I'd be happy to push it to Github.

alexanderkoller commented 5 years ago

Hi,

I did some testing on Windows. Your fix makes the error message go away, but I can't get the (German) speech recognizer to recognize any words with umlauts on Windows. In a DirectGrammar with the words "vier", "fünf", and "Mühle", the recognizer got "vier" just fine, but never recognized "fünf" and "Mühle".

This is extra mysterious because the Writer into which we are writing the JSGF grammar is just a StringWriter, so I don't even understand why the character encoding would make a difference. Except if Sphinx then secretly writes the grammar to a file after all, or something like that.

I have now changed the main() method of DialogOS as a whole to enforce UTF-8 as the default charset. This seems to fix the problem. At any rate, I can now get "fünf" and "Mühle" to be recognized correctly on Windows. I ended up deciding to hardcode the encoding programmatically because we sometimes run DialogOS through Gradle and sometimes after running the installer, and I didn't want to introduce extra discrepancies.

akoehn commented 5 years ago

What are your feelings towards enforcing UTF-8 as the default encoding in DialogOS?

It should be enforced not only because of this problem but also for the sake of the inter-OS compatibility of files saved with dialogos.

timobaumann commented 5 years ago

It should be enforced not only because of this problem but also for the sake of the inter-OS compatibility of files saved with dialogos.

we're only ever writing XML which contains a notice about the encoding used.

alexanderkoller commented 5 years ago

And are we making sure that we are writing the true encoding into the XML, and not just "UTF-8"?

Nonetheless: Do you see a problem with hard-coding the use of UTF-8 into DialogOS?

timobaumann commented 5 years ago

I don't see a problem, no. Any UTF-8 problems in the XML would have bitten us (or CLT) before, I think. Is the vier fünf Mühlen thing still an issue?

alexanderkoller commented 5 years ago

No, the vier fünf Mühlen works, which is why I closed the issue. The fix is in 2.0.3.

dialogos-project / dialogos

Speech recognizer can't handle umlauts under Windows #108