Closed alexanderkoller closed 5 years ago
I'm unassigning myself from non-utf-8-environment bugs, sorry.
can someone confirm on Windows that this fixes the issue? thank you. If so, we should probably also think about adding encodings to all other grammar.export() types.
I really don't get it. the code that is problematic is:
case JSGFwithGarbage:
if (out instanceof PrintWriter) {
w = (PrintWriter) out;
} else {
w = new PrintWriter(new BufferedWriter(out));
}
w.println("#JSGF V1.0 UTF-8;");
w.println();
break;
it used to not contain the UTF-8 encoding remark. However, PrintWriter is supposed to use the system's default encoding; so the change I've implemented looks dangerous. In particular, I don't know if we're ever using the second code-path (the one with BufferedWriter()) and whether I've now broken that. Please, could someone double-check?
funny enough, this is happening before sphinx-4 is involved.
Is there any way to test this without building a new release of DialogOS? If I just do "gradlew run", there is no German recognizer model, right?
Another useful piece of info: It is possible to modify the default charset in Java after the JVM has been started, see http://araklefeistel.blogspot.com/2015/10/set-fileencoding-in-jvm.html
Is there any way to test this without building a new release of DialogOS?
A simple hack would be to copy install4j/models/sphinx4-de-de-uhh.jar
from dialogos-distribution to your dialogos directory and temporarily add it to your runtime dependencies as done in the dialogos-distribution build.gradle.
I just used the combination äöü. The problem was in the Grammar after all so it doesn't matter whether the lexicon will be able to pronounce it.
If you want to test thoroughly, you can add a word with Umlaut to the pronunciation execption dictionary. E.g.: Motörhead -> M OW T ER HH EH D
I guess, what I wanted to write was: why use German? Any language will do.
Yes, I know. :) That's a good point.
The bug seems to be fixed (I tried an English DirectGrammar with umlauts on Windows 10). I will now make a test release and see if the installed version, with a German DirectGrammar, works too.
I do have one question: If I understand your fix correctly, DialogOS generates a grammar for Sphinx, and after the fix says explicitly in the grammar file that the grammar is encoded in UTF-8. But given that the default character encoding under Windows is CP1252 and we use the default encoding in the Writer, why is the grammar file actually encoded in UTF-8?
What are your feelings towards enforcing UTF-8 as the default encoding in DialogOS? The code I linked to in my comment above seems to be working correctly, and I'd be happy to push it to Github.
Hi,
I did some testing on Windows. Your fix makes the error message go away, but I can't get the (German) speech recognizer to recognize any words with umlauts on Windows. In a DirectGrammar with the words "vier", "fünf", and "Mühle", the recognizer got "vier" just fine, but never recognized "fünf" and "Mühle".
This is extra mysterious because the Writer into which we are writing the JSGF grammar is just a StringWriter, so I don't even understand why the character encoding would make a difference. Except if Sphinx then secretly writes the grammar to a file after all, or something like that.
I have now changed the main() method of DialogOS as a whole to enforce UTF-8 as the default charset. This seems to fix the problem. At any rate, I can now get "fünf" and "Mühle" to be recognized correctly on Windows. I ended up deciding to hardcode the encoding programmatically because we sometimes run DialogOS through Gradle and sometimes after running the installer, and I didn't want to introduce extra discrepancies.
What are your feelings towards enforcing UTF-8 as the default encoding in DialogOS?
It should be enforced not only because of this problem but also for the sake of the inter-OS compatibility of files saved with dialogos.
It should be enforced not only because of this problem but also for the sake of the inter-OS compatibility of files saved with dialogos.
we're only ever writing XML which contains a notice about the encoding used.
And are we making sure that we are writing the true encoding into the XML, and not just "UTF-8"?
Nonetheless: Do you see a problem with hard-coding the use of UTF-8 into DialogOS?
I don't see a problem, no. Any UTF-8 problems in the XML would have bitten us (or CLT) before, I think. Is the vier fünf Mühlen thing still an issue?
No, the vier fünf Mühlen works, which is why I closed the issue. The fix is in 2.0.3.
On Windows, speech recognizer nodes throw an exception upon execution if a word contains an umlaut. To reproduce, create a speech recognizer node with a DirectGrammar and language=German, add an InputWord with an umlaut in it, and click on "Try Recognition".
The stacktrace and a screenshot are below.
@timobaumann , is there any chance that you can get access to a Windows machine and have a look at it urgently?