LanguageMachines / libfolia

FoLiA library for C++
https://proycon.github.io/folia
GNU General Public License v3.0
15 stars 7 forks source link

incorrect extraction of deep text from a document with corrections #49

Open kosloot opened 1 year ago

kosloot commented 1 year ago

the text() extraction function fails to extract the correct text from a sentence where the last Word is a Correction, and the sentence is followed by another sentence. This came up in: https://github.com/LanguageMachines/foliautils/issues/66

When the last Word is truly a Word, a space separator is added, and everything is fine. But in case of a Correction the space is omitted, gluing the 2 sentences text together. Example (rather braindead, but is proves the point)

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="folia.xsl"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="Walter" generator="libfolia-v2.12" version="2.5.1">
  <metadata type="native">
    <annotations>
      <token-annotation alias="tokconfig-deu" set="https://raw.githubusercontent.com/LanguageMachines/uctodata/master/setdefinitions/tokconfig-deu.foliaset.ttl">
        <annotator processor="FoLiA-correct.1"/>
        <annotator processor="ucto.1"/>
      </token-annotation>
      <paragraph-annotation>
        <annotator processor="ucto.1"/>
      </paragraph-annotation>
      <sentence-annotation>
        <annotator processor="ucto.1"/>
      </sentence-annotation>
      <text-annotation set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/text.foliaset.ttl"/>
      <correction-annotation set="Ticcl-set">
        <annotator processor="FoLiA-correct.1"/>
      </correction-annotation>
    </annotations>
    <provenance>
      <processor xml:id="ucto.1" begindatetime="2022-10-06T12:10:53" command="ucto -X -L deu --textredundancy=full --id Walter bug.in bug.folia.xml" folia_version="2.5.1" host="kobus" name="ucto" user="sloot" version="0.26">
        <processor xml:id="ucto.1.generator" folia_version="2.5.1" name="libfolia" type="generator" version="2.12"/>
        <processor xml:id="uctodata.1" name="uctodata" type="datasource" version="0.9.1">
          <processor xml:id="uctodata.1.1" name="tokconfig-deu" type="datasource" version="0.2"/>
        </processor>
      </processor>
      <processor xml:id="FoLiA-correct.1" begindatetime="2022-10-06T12:11:06" command="FoLiA-correct --ngram=3 -e folia.xml -O OUT --rank=data/DeutscheEssays.RANK.withunderscore.ranked --unk=data/DeutscheEssays.UNK.withunderscore.unk --punct=data/DeutscheEssays.UNK.withunderscore.punct" folia_version="2.5.1" host="kobus" name="FoLiA-correct" user="sloot" version="0.19">
        <processor xml:id="FoLiA-correct.1.generator" folia_version="2.5.1" name="libfolia" type="generator" version="2.12"/>
      </processor>
    </provenance>
    <meta id="language">deu</meta>
  </metadata>
  <text xml:id="Walter.text">
    <p xml:id="Walter.p.1">
      <t>chat... Von</t>
      <s xml:id="Walter.p.1.s.1">
        <t>chat...</t>
        <w xml:id="Walter.p.1.s.1.w.1" class="WORD" processor="ucto.1" space="no">
          <t>chat</t>
        </w>
        <correction xml:id="Walter.p.1.s.1.correction.1">
          <new>
            <w xml:id="Walter.p.1.s.1.w.3.edit.1" processor="FoLiA-correct.1">
              <t>...</t>
            </w>
          </new>
          <original auth="no">
            <w xml:id="Walter.p.1.s.1.w.3" class="PUNCTUATION-MULTI" processor="ucto.1">
              <t>...</t>
            </w>
          </original>
        </correction>
      </s>
      <s xml:id="Walter.p.1.s.2">
        <t>Von</t>
        <w xml:id="Walter.p.1.s.2.w.1" class="WORD" processor="ucto.1">
          <t>Von</t>
        </w>
      </s>
    </p>
  </text>
</FoLiA>

When parsing this file, withe folialint:

bug.xml failed: inconsistent text: node p(Walter.p.1) has a mismatch for the text in set:current
the element text ='chat... Von'
 the deeper text ='chat...Von'
proycon commented 1 year ago

I indeed get something similar with foliapy:

$ foliavalidator issue49.folia.xml
VALIDATION ERROR on full parse by library (stage 2/3), in issue49.folia.xml
ParseError: FoLiA exception in handling of <p> @ line 35 (in parent <text> @ parent line 34) : [InconsistentText] Text for <Sentence at 140195206213216 id=Walter.p.1.s.1 set=None class=None>, is inconsistent: EXPECTED (deep text after normalization) *****>
chat
****> BUT FOUND (strict text after normalization) ****>
chat...
******* DEVIATION POINT: <*HERE*>chat...
(also checked against older rules prior to FoLiA v2.4.1)
proycon commented 1 year ago

The foliapy error is correct though:

Seems a bit different from the error in libfolia.

kosloot commented 1 year ago

Ok, my bad. I corrected the example to have the same ellipsis in <new>. (silly but ok) folialint gives still the same error (but foliavalidator validates it)

kosloot commented 1 year ago

A somewhat shaky solution is committed now. Needs testing

kosloot commented 1 year ago

This fix enables @martinreynaert to run his corrections, but also AGAIN shows a difference of opinions between libfolia and FoLiaPY.

Running FoLiA-correct on only the first part of the title of the text file already reveals this, The produced folia is rejected by voliavalidator, but folialint is satisfied. The latter being wrong, imnsho.

The test file:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="folia.xsl"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="bug" generator="libfolia-v2.12" version="2.5.1">
  <metadata type="native">
    <annotations>
      <token-annotation alias="tokconfig-deu" set="https://raw.githubusercontent.com/LanguageMachines/uctodata/master/setdefinitions/tokconfig-deu.foliaset.ttl">
        <annotator processor="FoLiA-correct.1"/>
        <annotator processor="ucto.1"/>
      </token-annotation>
      <paragraph-annotation>
        <annotator processor="ucto.1"/>
      </paragraph-annotation>
      <sentence-annotation>
        <annotator processor="ucto.1"/>
      </sentence-annotation>
      <text-annotation set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/text.foliaset.ttl"/>
      <correction-annotation set="Ticcl-set">
        <annotator processor="FoLiA-correct.1"/>
      </correction-annotation>
    </annotations>
    <provenance>
      <processor xml:id="ucto.1" begindatetime="2022-10-08T08:48:33" command="ucto -X -L deu --textredundancy=full --id bug bug.txt bug.folia.xml" folia_version="2.5.1" host="kobus" name="ucto" user="sloot" version="0.26">
        <processor xml:id="ucto.1.generator" folia_version="2.5.1" name="libfolia" type="generator" version="2.12"/>
        <processor xml:id="uctodata.1" name="uctodata" type="datasource" version="0.9.1">
          <processor xml:id="uctodata.1.1" name="tokconfig-deu" type="datasource" version="0.2"/>
        </processor>
      </processor>
      <processor xml:id="FoLiA-correct.1" begindatetime="2022-10-08T08:49:07" command="FoLiA-correct --ngram=3 -e folia.xml -O OUT --rank=data/DeutscheEssays.RANK.withunderscore.ranked --unk=data/DeutscheEssays.UNK.withunderscore.unk --punct=data/DeutscheEssays.UNK.withunderscore.punct" folia_version="2.5.1" host="kobus" name="FoLiA-correct" user="sloot" version="0.19">
        <processor xml:id="FoLiA-correct.1.generator" folia_version="2.5.1" name="libfolia" type="generator" version="2.12"/>
      </processor>
    </provenance>
    <meta id="language">deu</meta>
  </metadata>
  <text xml:id="bug.text">
    <p xml:id="bug.p.1">
      <t>Walter Muschg Freud</t>
      <t class="Ticcl">Walter musch Freud</t>
      <s xml:id="bug.p.1.s.1">
        <t>Walter Muschg Freud</t>
        <t class="Ticcl">Walter musch Freud</t>
        <w xml:id="bug.p.1.s.1.w.1" class="WORD" processor="ucto.1">
          <t>Walter</t>
          <t class="Ticcl" offset="0">Walter</t>
        </w>
        <correction xml:id="bug.p.1.s.1.correction.1">
          <new>
            <w xml:id="bug.p.1.s.1.w.2.edit.1" processor="FoLiA-correct.1">
              <t class="Ticcl" offset="7">musch</t>
            </w>
          </new>
          <original auth="no">
            <w xml:id="bug.p.1.s.1.w.2" class="WORD" processor="ucto.1">
              <t>Muschg</t>
            </w>
          </original>
        </correction>
        <w xml:id="bug.p.1.s.1.w.3" class="WORD" processor="ucto.1">
          <t>Freud</t>
          <t class="Ticcl" offset="13">Freud</t>
        </w>
      </s>
    </p>
  </text>
</FoLiA>

& folialint --nooutput bug.ticcl.folia.xml Validated successfully: bug.ticcl.folia.xml

foliavalidator bug.ticcl.folia.xml VALIDATION ERROR on full parse by library (stage 2/3), in bug.ticcl.folia.xml ParseError: FoLiA exception in handling of <p> @ line 35 (in parent @ parent line 34) : [InconsistentText] Text for <Sentence at 140336671748544 id=bug.p.1.s.1 set=None class=None>, is inconsistent: EXPECTED (deep text after normalization) *> Walter Freud > BUT FOUND (strict text after normalization) > Walter Muschg Freud *** DEVIATION POINT: Walter <HERE>Muschg Fre (also checked against older rules prior to FoLiA v2.4.1)

@proycon I remember that issues like this have been discussed before. like in https://github.com/proycon/folia/issues/98 and https://github.com/proycon/folia/issues/75

But the argument has not been settled, it seems. And I agree that it is a difficult problem.

kosloot commented 1 year ago

note also related to: https://github.com/proycon/folia/issues/100 which is deemed Low Priority unfortunately