GateNLP / gate-core

The GATE Embedded core API and GATE Developer application
GNU Lesser General Public License v3.0
78 stars 29 forks source link

DocumentStaxUtils ignores CDATA within TextWithNodes #7

Closed ianroberts closed 7 years ago

ianroberts commented 7 years ago

As reported on the gate-users mailing list, the GATE XML format parser silently discards any CDATA sections that fall within the TextWithNodes part of a GATE XML document.

When saving as GATE XML the serialiser uses CDATA to represent any segment of text within the TextWithNodes or any feature name or value that contains more than a few less-than signs, as this is more compact and human-readable than escaping each one individually as <. The parser handles CDATA correctly when reading feature names and values but not the TextWithNodes.

This problem is not generally apparent as the serialiser only uses CDATA when there are lots of less-thans within a single span of text - if you were to save an ANNIE-processed document with a big run of <<<<<<<<<<, each symbol would be a separate Token and thus there would be empty Node elements between each pair and no single run would have more than one <. However if the same document were saved as GATE XML with minimal annotations (e.g. just human-annotated entities and no Tokens) then it would hit this bug.

ianroberts commented 7 years ago

Fixed by ceebfce826459a32cb02581d4855dbab3eb472a9