LanguageMachines / libfolia

FoLiA library for C++
https://proycon.github.io/folia
GNU General Public License v3.0
15 stars 7 forks source link

Apply space attribute more generically to multiple structure elements #34

Closed proycon closed 4 years ago

proycon commented 5 years ago

See proycon/folia#61

kosloot commented 5 years ago

I would need an example or use case to test.

kosloot commented 5 years ago

The example given in proycon/folia#61 is:

<FoLiA xmlns="http://ilk.uvt.nl/folia" xmlns:xlink="http://www.w3.org/1999/xlink" xml:id="test" version="1.5" >
  <metadata>
      <annotations>
      </annotations>
  </metadata>
  <text xml:id="t1">
    <p xml:id="p1">
      <t>Een test</t>
      <part xml:id="part1">
        <t>Een</t>
      </part>
      <part xml:id="part2">
        <t>test</t>
      </part>
    </p>
  </text>
</FoLiA>

So my simple question is: HOW do we fix this? BTW:

kosloot commented 5 years ago

Ok, let me explain this with an adapted example:

<FoLiA xmlns="http://ilk.uvt.nl/folia" xmlns:xlink="http://www.w3.org/1999/xlink" xml:id="test" version="2.0" >
  <metadata>
    <annotations>
      <text-annotation set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/text.foliaset.ttl"/>
      <part-annotation set="bla"/>
      <paragraph-annotation set="bla"/>
    </annotations>
  </metadata>
  <text xml:id="t1">
    <p xml:id="p1">
      <t>Een test</t>
      <part xml:id="part1">
        <t>Een</t>
      </part>
      <part xml:id="part2">
        <t>test</t>
      </part>
    </p>
  </text>
</FoLiA>

Both folialint AND foliavalidator reject this:

> /home/sloot/usr/local/bin/folialint spacy.xml
spacy.xml failed: inconsistent text: node p(p1) has a mismatch for the text in set:current
the element text ='Een test'
 the deeper text ='Eentest'
> foliavalidator spacy.xml --output -t
VALIDATION ERROR on full parse by library (stage 2/2), in spacy.xml
InconsistentText: Text for Paragraph, ID p1, class current, is inconsistent: EXPECTED (after normalization) *****>
Eentest
****> BUT FOUND (after normalization) ****>
Een test
******* DEVIATION POINT: Een<*HERE*> test

i would expect that after the recent changes, 'part' has a default space="yes", but that is not honoured it seems...

kosloot commented 5 years ago

So I "fixed" libfolia to accept this (without problems on other tests) and also an even more convoluted example (attached as text file: spacys.txt )

proycon commented 5 years ago

Hmm right, I see the point:

i would expect that after the recent changes, 'part' has a default space="yes", but that is not honoured > it seems...

It seems the default text delimiter for <part> is indeed empty (i.e. no space), which is why issue proycon/folia#61 arose in the first place (otherwise it would have been valid). We may want to change this to have a space as a text delimiter, though I wonder if that breaks backward compatibility. I'll have to conduct some tests with older versions to see if perhaps we inadvertently changed the behaviour.

kosloot commented 5 years ago

Fixed in libfola, @proycon I assume in Python too?