cltl / morphosyntactic_parser_nl

Morphosyntactic parser for Dutch based on the Alpino parser
Apache License 2.0
5 stars 4 forks source link

unicode in comments #9

Closed vanatteveldt closed 8 years ago

vanatteveldt commented 8 years ago

The parser creates comments for the relations to make it easier to trace them. If these comments contain unusual unicode, however, the Java parser chokes on them (see https://bugs.openjdk.java.net/browse/JDK-8072081)

Although this is not strictly a problem caused by the parser (as it is a java bug triggered by the IXA NERC module) I think the easiest solution is to strip or escape "strange" unicode characters in the parser step.

Session and example files: https://gist.github.com/vanatteveldt/6492fc3b97ba6f2a87c81462c71fe8a2

rubenIzquierdo commented 8 years ago

I can not reproduce the error. It works ok for me…

[cid:5FC3621A-AB26-477A-9EFD-5F96D1752D25@dlsi.ua.es]

[cid:5BB13AC6-42A1-469B-A6B6-40CAF98213E7@dlsi.ua.es]

Ruben Izquierdo Bevia Vrije University of Amsterdam ruben.izquierdobevia@vu.nlmailto:ruben.izquierdobevia@vu.nl http://rubenizquierdobevia.com/

On 19 May 2016, at 11:36, Wouter van Atteveldt notifications@github.com<mailto:notifications@github.com> wrote:

The parser creates comments for the relations to make it easier to trace them. If these comments contain unusual unicode, however, the Java parser chokes on them (see https://bugs.openjdk.java.net/browse/JDK-8072081)

Although this is not strictly a problem caused by the parser (as it is a java bug triggered by the IXA NERC module) I think the easiest solution is to strip or escape "strange" unicode characters in the parser step.

Session and example files: https://gist.github.com/vanatteveldt/6492fc3b97ba6f2a87c81462c71fe8a2

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHubhttps://github.com/cltl/morphosyntactic_parser_nl/issues/9

vanatteveldt commented 8 years ago

Do you mean that the java call to ixa-pipe-nerc works fine?

What is your java version?

$ java -version
java version "1.8.0_66"
Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
rubenIzquierdo commented 8 years ago

Sorry, I was indeed able to reproduce the problem. I solved it by escaping the comments, the escaping is done with:

str_comment = str_comment.encode('ascii', 'xmlcharrefreplace')

Now it seems the java parser does not complain…

(python2.7)izquierdo@kyoto:~/cltl_repos/morphosyntactic_parser_nl$ cat test0.naf | run_parser.sh > test1.naf Calling to Alpino at /home/izquierdo/tools/Alpino/ with 1 sentences... hdrug: process 27568 on host kyoto (datime(2016,5,19,19,48,28)) [Ik 😳 Alpino !] Q#1|Ik 😳 Alpino !|1|1|1.0051390171199999 Processing file /tmp/tmpoqmBIF/1.xml Creating the term layer... Creating the constituency layer... Creating the dependency layer... hdrug: process 27573 on host kyoto (datime(2016,5,19,19,48,29)) (python2.7)izquierdo@kyoto:~/cltl_repos/morphosyntactic_parser_nl$

(python2.7)izquierdo@kyoto:~/cltl_repos/morphosyntactic_parser_nl$ tail -5 test1.naf

(python2.7)izquierdo@kyoto:~/cltl_repos/morphosyntactic_parser_nl$ cat test1.naf | ~/newsreader_repos/ixa-pipe-nerc/run_nl.sh > test2.naf CLI options: Namespace(lexer=off, params=/home/izquierdo/newsreader_repos/ixa-pipe-nerc/nerc-resources/nl/nl-local-conll02-testa.prop) -> Token features added!: Window range 2:2 -> Token Class features added!: Window range 2:2 -> Outcome prior features added! -> Previous map features added! -> Sentence features added! -> Prefix features added! -> Suffix features added! -> Bigram class features added! -> Trigram class features added! -> CharNgram features added!: Range 2:5 -> Token features added!: Window range 2:2 -> Token Class features added!: Window range 2:2 -> Outcome prior features added! -> Previous map features added! -> Sentence features added! -> Prefix features added! -> Suffix features added! -> Bigram class features added! -> Trigram class features added! -> CharNgram features added!: Range 2:5 (python2.7)izquierdo@kyoto:~/cltl_repos/morphosyntactic_parser_nl$

(python2.7)izquierdo@kyoto:~/cltl_repos/morphosyntactic_parser_nl$ wc -l test2.naf 118 test2.naf

(python2.7)izquierdo@kyoto:~/cltl_repos/morphosyntactic_parser_nl$ grep -A 10 "<ent" test2.naf

# Ruben Izquierdo Bevia Vrije University of Amsterdam ruben.izquierdobevia@vu.nlmailto:ruben.izquierdobevia@vu.nl http://rubenizquierdobevia.com/ On 19 May 2016, at 13:11, Wouter van Atteveldt > wrote: Do you mean that the java call to ixa-pipe-nerc works fine? What is your java version? $ java -version java version "1.8.0_66" Java(TM) SE Runtime Environment (build 1.8.0_66-b17) Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode) — You are receiving this because you commented. Reply to this email directly or view it on GitHubhttps://github.com/cltl/morphosyntactic_parser_nl/issues/9#issuecomment-220294702