emorynlp / nlp4j

NLP framework for JVM languages.
http://emorynlp.github.io/nlp4j/
Other
149 stars 33 forks source link

AbstractNLPDecoder and Tokenizer makes character encoding assumption #21

Open dlutz2 opened 7 years ago

dlutz2 commented 7 years ago

The various decode operations in AbstractNLPDecoder and its underlying tokenizer, use String.getBytes() which converts the String to bytes using the OS's default character set, which can corrupt the String if the default character set differs from the one used by the String. This case will occur on Windows for any UTF-8 data (beyond the ASCII range) since Windows default character set is CP-1252. Using operations that include specifying the desired character set, such as InputStreamReader will avoid this.

jdchoi77 commented 7 years ago

Thanks for the comment; could you please give an example of where that you think needs to be fixed using InputStreamReader? We'll do the evaluation and apply the update. Thanks.

dlutz2 commented 7 years ago

The simple test below if run on a platform whose default character set is UTF-8 or by explicitly setting the character set (-Dfile.encoding=UTF-8) will produce the expected results. Running on Windows without explicitly setting the character set uses the OS default character set ( equivalent to using -Dfile.encoding=windows-1252) and will garble the non-Latin characters. Note that running this in a development environment like Eclipse, may not show the error since Eclipse automatically adds the -Dfile.encoding property to the invocation. The reference to InputStreamReader was just a suggestion, could also do something like someString.getBytes(someCharSet), as long as the Strings/Streams/Files are read with an explicit character set. It would be nice if this character set was a parser/tokenizer config option. If it must be hardcoded, then UTF-8 would likely be the best guess. thanks

public static void main(String[] args) throws IOException {

    System.out.println("Default Charset=" + Charset.defaultCharset());

    String configFile = "src/main/resources/org/opensextant/relish/config-decode-en.xml";

    NLPDecoder parser = new NLPDecoder(IOUtils.createFileInputStream(configFile));

    String text = "We live in Europe (قارة اوروبة).";

    List<NLPNode[]> sentences = parser.decodeDocument(text);
    for (NLPNode[] sentence : sentences) {
        for (NLPNode node : sentence) {
            System.out.println(node);
        }
    }
}

}