LanguageMachines / libfolia

FoLiA library for C++
https://proycon.github.io/folia
GNU General Public License v3.0
15 stars 7 forks source link

"empty" textelements may have bogus offsets #53

Closed kosloot closed 11 months ago

kosloot commented 1 year ago

A bit related to https://github.com/LanguageMachines/libfolia/issues/52: and https://github.com/proycon/folia/issues/107

the library accepts bogus offset values for empty elements. Input:

<?xml version="1.0" encoding="UTF-8"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="bugxx" generator="libfolia-v1.11" version="2.5">
  <metadata type="native">
    <annotations>
      <text-annotation set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/text.foliaset.ttl"/>
      <division-annotation/>
      <paragraph-annotation/>
      <sentence-annotation/>
      <token-annotation/>
      <hyphenation-annotation/>
      <string-annotation/>
    </annotations>
  </metadata>
  <text xml:id="bug">
    <div xml:id="bug.div">
      <p xml:id="bug.div.p">
        <s xml:id="bug.div.p.s.1">
      <t>appel<t-hbr>-</t-hbr>taart</t>
          <str xml:id="bug.div.p.s.1.str.1">
            <t offset="0">appel</t>
      </str>
          <str xml:id="bug.div.p.s.1.str.2">
            <t offset="766666665"><t-hbr>-</t-hbr></t>
      </str>
          <str xml:id="bug.div.p.s.1.str.3">
            <t offset="5">taart</t>
      </str>
        </s>
      </p>
    </div>
  </text>
</FoLiA>

The offset of str.2 is way off. But will pass. The value should be 5, or at least within the range [0-9] (the valid characters in the text of the sentence)

kosloot commented 1 year ago

Experimental code for libfolia is in GitHub now. Seems to work