LanguageMachines / foliautils

Command-line utilities for working with the Format for Linguistic Annotation (FoLiA), powered by libfolia (C++), written by Ko van der Sloot (CLST, Radboud University)
https://proycon.github.io/folia
GNU General Public License v3.0
4 stars 3 forks source link

FoLia-correct: resolve HEMP's using FoLiA::Correction #47

Open kosloot opened 4 years ago

kosloot commented 4 years ago

This came up after issue #45

when resolving a HEMP, FoLiA-correct just adds the resolved text to one of the string/word nodes. I assume using a real Correction would be better.

for example:

    <p xml:id="mwsel.p.1">
      <t class="OCR">•c c•</t>
      <str xml:id="mwsel.p.1.str.1">
        <t class="OCR">•c</t>
      </str>
      <str xml:id="mwsel.p.1.str.2">
        <t class="OCR">c•</t>
      </str>
    </p>

assuming •c c• is in the PUNCT file as •c c• cc this HEMP is resolved as:

   <p xml:id="mwsel.p.1">
      <t>cc</t>
      <t class="OCR">•c c•</t>
      <str xml:id="mwsel.p.1.str.1">
        <t class="OCR">•c</t>
      </str>
      <str xml:id="mwsel.p.1.str.2">
        <t offset="0">cc</t>
        <t class="OCR">c•</t>
      </str>
    </p>

IMHO a much better solution would be:

   <p xml:id="mwsel.p.1">
      <t>cc</t>
      <t class="OCR">•c c•</t>
      <correction xml:id="mwsel.p.1.correction.1">
        <new>
          <str xml:id="mwsel.p.1.str.edit.1">
            <t >cc</t>
          </str>
        </new>
         <original>
          <str xml:id="mwsel.p.1.str.1">
            <t class="OCR">•c</t>
          </str>
          <str xml:id="mwsel.p.1.str.2">
            <t class="OCR">c•</t>
          </str>
        </original>
      </correction>
    </p>

interesting point: HEMP resolution is done before other corrections. I assume that a real correction using the cc will not be performed.