cltl / morphosyntactic_parser_nl

Morphosyntactic parser for Dutch based on the Alpino parser
Apache License 2.0
5 stars 4 forks source link

pipe symbols (|) not escaped in input #10

Closed vanatteveldt closed 8 years ago

vanatteveldt commented 8 years ago

Sentence containing a pipe symbol / vertical bar (|) are not processed correctly. Alpino uses this character to indicate line id's, so if a sentence contains a pipe the left hand side is treated as an id, containing the 1.xml to not be found, and the parse is not included in the output:

(newsreader-env)wva@study-linux: {master} ~/newsreader_pipe_nl$ echo "Hallo daar| doeg" |  java -jar $MDIR/ixa-pipe-tok/target/ixa-pipe-tok-1.8.4.jar tok -l nl | $MDIR/morphosyntactic_parser_nl/run_parser.sh
CLI options: Namespace(normalize=default, notok=false, inputkaf=false, offsets=true, outputFormat=naf, hardParagraph=no, untokenizable=no, lang=nl, kafversion=v1.naf)
ixa-pipe-tok tokenized 4 tokens at 1908.31 tokens per second.
Calling to Alpino at /data/wva/newsreader_pipe_nl/tools/Alpino with 1 sentences...
hdrug: process 14878 on host study-linux (datime(2016,7,17,13,34,17))
[doeg]
Q#Hallo daar \|doeg|1|1|0.749662063
Not found the file /tmp/tmpOfK5Ti/1.xml

This results in the following output (without terms):

<?xml version='1.0' encoding='UTF-8'?>
<NAF xml:lang="nl" version="v1.naf">
  <nafHeader>
    <linguisticProcessors layer="text">
      <lp name="ixa-pipe-tok-nl" beginTimestamp="2016-07-17T13:34:17+0200" endTimestamp="2016-07-17T13:34:17+0200" version="1.8.4-9bb9cddd179cbd489b085776417cd8f1b8a4b10a" hostname="study-linux"/>
    </linguisticProcessors>
    <linguisticProcessors layer="terms">
      <lp name="Morphosyntactic parser based on Alpino" version="0.2_22sept2015" timestamp="2016-07-17T13:34:18CEST" beginTimestamp="2016-07-17T13:34:18CEST" endTimestamp="2016-07-17T13:34:18CEST" hostname="study-linux"/>
    </linguisticProcessors>
    <linguisticProcessors layer="constituents">
      <lp name="Morphosyntactic parser based on Alpino" version="0.2_22sept2015" timestamp="2016-07-17T13:34:18CEST" beginTimestamp="2016-07-17T13:34:18CEST" endTimestamp="2016-07-17T13:34:18CEST" hostname="study-linux"/>
    </linguisticProcessors>
    <linguisticProcessors layer="deps">
      <lp name="Morphosyntactic parser based on Alpino" version="0.2_22sept2015" timestamp="2016-07-17T13:34:18CEST" beginTimestamp="2016-07-17T13:34:18CEST" endTimestamp="2016-07-17T13:34:18CEST" hostname="study-linux"/>
    </linguisticProcessors>
  </nafHeader>
  <text>
    <wf id="w1" offset="0" length="5" sent="1" para="1">Hallo</wf>
    <wf id="w2" offset="6" length="4" sent="1" para="1">daar</wf>
    <wf id="w3" offset="10" length="1" sent="1" para="1">|</wf>
    <wf id="w4" offset="12" length="4" sent="1" para="1">doeg</wf>
  </text>
</NAF>

Note that this is not an error condition, so the "not found the file" does not raise an exception (which it probably should?)

$ echo $?
0

(this seems to be the root cause of https://github.com/ixa-ehu/ixa-pipe-nerc/issues/11)

rubenIzquierdo commented 8 years ago

Fixed by generaring automatically the sentence identifiers "1|text text text". Alpino only gives special meaning to the first occurrence of the symbol "|", so the rest of symbols in the text do not raise any exception