LanguageMachines / libfolia

FoLiA library for C++
https://proycon.github.io/folia
GNU General Public License v3.0
15 stars 7 forks source link

folialint produces invalid FoLiA out of dubious input #23

Closed kosloot closed 5 years ago

kosloot commented 5 years ago

related to https://github.com/proycon/flat/issues/138

Consider this file:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="folia.xsl"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="test" generator="libfolia-v0.14" version="0.12.0">
  <metadata type="native">
    <annotations>
      <pos-annotation set="pos"/>
      <syntax-annotation set="syn"/>
    </annotations>
  </metadata>
  <text xml:id="test.text">
    <s xml:id="s.1">
      <w xml:id="s.1.w.1">
    <t>Is@</t>
    <pos class="BEP" />
      </w>
      <syntax>
    <su xml:id="s.1.su.1" class="IP-MAT">
          <su xml:id="s.1.su.2" class="NP-SBJ">
            <w xml:id="s.1.su.w.1">
              <t>*exp*</t>
              <pos class="EX" />
            </w>
          </su>
    </su>
    </syntax>
    </s>
  </text>
</FoLiA>

It contains a \<w> in the \<su> node that IS NOT present in the \<s> itself. That is a construction which is (until now) never thought of.

When running folialint on this file, an INVALID output is produced:

<?xml-stylesheet type="text/xsl" href="folia.xsl"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="test" generator="libfol
ia-v1.20" version="0.12.0">
  <metadata type="native">
    <annotations>
      <pos-annotation set="pos"/>
      <syntax-annotation set="syn"/>
    </annotations>
  </metadata>
  <text xml:id="test.text">
    <s xml:id="s.1">
      <w xml:id="s.1.w.1">
        <t>Is@</t>
        <pos class="BEP"/>
      </w>
      <syntax>
        <su xml:id="s.1.su.1" class="IP-MAT">
          <su xml:id="s.1.su.2" class="NP-SBJ">
            <wref id="s.1.su.w.1" t="*exp*"/>
          </su>
        </su>
      </syntax>
    </s>
  </text>
</FoLiA>

A \<wref> is generated to a non existing word!

Desired behavior:

proycon commented 5 years ago

also cross-referencing proycon/folia#58

kosloot commented 5 years ago

I added a patch to libfolia to output the \<w> AS IS when it is the only occurrence. (so NO reference to the same \<w> elsewhere.

kosloot commented 5 years ago

I ran foliavalidator too, on this file: It rejects it:

Error on line 19: Element su has extra content: w
Error on line 0: Extra element su in interleave
Error on line 17: Element su failed to validate content
Error on line 17: Element syntax failed to validate content
Error on line 0: Extra element syntax in interleave
Error on line 3: Element FoLiA failed to validate content
VALIDATION ERROR against RelaxNG schema (stage 1/2), in tests/scary.xml
Element su has extra content: w, line 19
kosloot commented 5 years ago

Maybe an open door. But text inside this kind of words is exempt from all text processing. Which is probably exact what you would like to see... In the example, the s.text() function should return just Is@, and not Is@ *exp*

kosloot commented 5 years ago

I modified libfolia to also reject this kind of construction:

failed: XML error: connecting a <w> to an <su> is forbidden, use <wref>

I think is is correct now

proycon commented 5 years ago

Agreed