Closed vanatteveldt closed 8 years ago
I can not reproduce the error. It works ok for me…
[cid:5FC3621A-AB26-477A-9EFD-5F96D1752D25@dlsi.ua.es]
[cid:5BB13AC6-42A1-469B-A6B6-40CAF98213E7@dlsi.ua.es]
Ruben Izquierdo Bevia Vrije University of Amsterdam ruben.izquierdobevia@vu.nlmailto:ruben.izquierdobevia@vu.nl http://rubenizquierdobevia.com/
On 19 May 2016, at 11:36, Wouter van Atteveldt notifications@github.com<mailto:notifications@github.com> wrote:
The parser creates comments for the relations to make it easier to trace them. If these comments contain unusual unicode, however, the Java parser chokes on them (see https://bugs.openjdk.java.net/browse/JDK-8072081)
Although this is not strictly a problem caused by the parser (as it is a java bug triggered by the IXA NERC module) I think the easiest solution is to strip or escape "strange" unicode characters in the parser step.
Session and example files: https://gist.github.com/vanatteveldt/6492fc3b97ba6f2a87c81462c71fe8a2
— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHubhttps://github.com/cltl/morphosyntactic_parser_nl/issues/9
Do you mean that the java call to ixa-pipe-nerc works fine?
What is your java version?
$ java -version
java version "1.8.0_66"
Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
Sorry, I was indeed able to reproduce the problem. I solved it by escaping the comments, the escaping is done with:
str_comment = str_comment.encode('ascii', 'xmlcharrefreplace')
Now it seems the java parser does not complain…
(python2.7)izquierdo@kyoto:~/cltl_repos/morphosyntactic_parser_nl$ cat test0.naf | run_parser.sh > test1.naf Calling to Alpino at /home/izquierdo/tools/Alpino/ with 1 sentences... hdrug: process 27568 on host kyoto (datime(2016,5,19,19,48,28)) [Ik 😳 Alpino !] Q#1|Ik 😳 Alpino !|1|1|1.0051390171199999 Processing file /tmp/tmpoqmBIF/1.xml Creating the term layer... Creating the constituency layer... Creating the dependency layer... hdrug: process 27573 on host kyoto (datime(2016,5,19,19,48,29)) (python2.7)izquierdo@kyoto:~/cltl_repos/morphosyntactic_parser_nl$
(python2.7)izquierdo@kyoto:~/cltl_repos/morphosyntactic_parser_nl$ tail -5 test1.naf
(python2.7)izquierdo@kyoto:~/cltl_repos/morphosyntactic_parser_nl$ cat test1.naf | ~/newsreader_repos/ixa-pipe-nerc/run_nl.sh > test2.naf CLI options: Namespace(lexer=off, params=/home/izquierdo/newsreader_repos/ixa-pipe-nerc/nerc-resources/nl/nl-local-conll02-testa.prop) -> Token features added!: Window range 2:2 -> Token Class features added!: Window range 2:2 -> Outcome prior features added! -> Previous map features added! -> Sentence features added! -> Prefix features added! -> Suffix features added! -> Bigram class features added! -> Trigram class features added! -> CharNgram features added!: Range 2:5 -> Token features added!: Window range 2:2 -> Token Class features added!: Window range 2:2 -> Outcome prior features added! -> Previous map features added! -> Sentence features added! -> Prefix features added! -> Suffix features added! -> Bigram class features added! -> Trigram class features added! -> CharNgram features added!: Range 2:5 (python2.7)izquierdo@kyoto:~/cltl_repos/morphosyntactic_parser_nl$
(python2.7)izquierdo@kyoto:~/cltl_repos/morphosyntactic_parser_nl$ wc -l test2.naf 118 test2.naf
(python2.7)izquierdo@kyoto:~/cltl_repos/morphosyntactic_parser_nl$ grep -A 10 "<ent" test2.naf
The parser creates comments for the relations to make it easier to trace them. If these comments contain unusual unicode, however, the Java parser chokes on them (see https://bugs.openjdk.java.net/browse/JDK-8072081)
Although this is not strictly a problem caused by the parser (as it is a java bug triggered by the IXA NERC module) I think the easiest solution is to strip or escape "strange" unicode characters in the parser step.
Session and example files: https://gist.github.com/vanatteveldt/6492fc3b97ba6f2a87c81462c71fe8a2