LanguageMachines / libfolia

FoLiA library for C++
https://proycon.github.io/folia
GNU General Public License v3.0
15 stars 7 forks source link

How to handle DOCTYPE and entities #24

Closed kosloot closed 5 years ago

kosloot commented 5 years ago

At the moment, processing of FoLiA documents with a !DOCTYPE seems risky:

Given this file:

<?xml version="1.0" encoding="UTF-8"?>
<FoLiA xmlns="http://ilk.uvt.nl/folia" xmlns:xlink="http://www.w3.org/1999/xlink" xml:id="test" version="0.8" generator="libfolia-v0.4">
  <metadata>
    <annotations/>
  </metadata>
  <text xml:id="WR-P-E-J-0000000001.text">
    <s xml:id="WR-P-E-J-0000000001.head.1.s.1">
      <t>Dit is als het ware é&eacute;n test.</t>
    </s>
  </text>
</FoLiA>

The XML parser (for instance in folialint) chokes: XML-error: Entity 'eacute' not defined foliavalidator says:

Malformed XML!

This is as such NOT an ERROR. xmllint says the same:

folia.xml:5: parser error : Entity 'eacute' not defined
      <t>Dit is als het ware é&eacute;n test.</t>

This can be solved by adding a !DOCTYPE with an ENTITY

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE test [
<!ENTITY eacute "é">
]>
<FoLiA xmlns="http://ilk.uvt.nl/folia" xmlns:xlink="http://www.w3.org/1999/xlink" xml:id="test" version="0.8" generator="libfolia-v0.4">
  <metadata>
    <annotations/>
  </metadata>
  <text xml:id="test.text">
    <s xml:id="test.1.s.1">
      <t>Dit is als het ware é&eacute;n test.</t>
    </s>
  </text>
</FoLiA>

Both xmlint and folialint now accept this document.

But: folialint ditches the !DOCTYPE producing:

<?xml version="1.0" encoding="UTF-8"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="test" generator="libfolia-v1.20" version="0.8">
  <metadata type="native">
    <annotations/>
  </metadata>
  <text xml:id="test.text">
    <s xml:id="test.s.1">
      <t>Dit is als het ware één test.</t>
    </s>
  </text>
</FoLiA>

So the entity IS resolved, but leaving out the DOCTYPE might be a problem in the future. Don't know... @proycon any opinion on this?

proycon commented 5 years ago

This is as such NOT an ERROR.

But Xmllint seems to disagree if a DOCTYPE is absent? So then these named character entities are not predefined by XML as such, unlike the numerical character entities I assume.

But: folialint ditches the !DOCTYPE producing:

Looks good to me, it replaces the entities, which is preferred, it's a UTF-8 document after all, may just as well use it. I don't care much about preserving them (it's semantically the same anyway).

kosloot commented 5 years ago

To clarify: xmllint, folialint as well as foliavalidator require the DOCTYPE. This may come as a surprise to users. (although we know that some users, e.g. the Nederlab project DID add lots of entity definitions)

My concern was, if 'resolving' the DOCTYPE would also remove other relevant information. But probably not. xmllint does exactly the same.

side-note: foliavalidator could use a better error message: Malformed XML! is not informative.

kosloot commented 5 years ago

Closing this. hoping @proycon improved the warning