As reported on the gate-users mailing list, the GATE XML format parser silently discards any CDATA sections that fall within the TextWithNodes part of a GATE XML document.
When saving as GATE XML the serialiser uses CDATA to represent any segment of text within the TextWithNodes or any feature name or value that contains more than a few less-than signs, as this is more compact and human-readable than escaping each one individually as <. The parser handles CDATA correctly when reading feature names and values but not the TextWithNodes.
This problem is not generally apparent as the serialiser only uses CDATA when there are lots of less-thans within a single span of text - if you were to save an ANNIE-processed document with a big run of <<<<<<<<<<, each symbol would be a separate Token and thus there would be empty Node elements between each pair and no single run would have more than one <. However if the same document were saved as GATE XML with minimal annotations (e.g. just human-annotated entities and no Tokens) then it would hit this bug.
As reported on the gate-users mailing list, the GATE XML format parser silently discards any CDATA sections that fall within the
TextWithNodes
part of a GATE XML document.When saving as GATE XML the serialiser uses CDATA to represent any segment of text within the
TextWithNodes
or any feature name or value that contains more than a few less-than signs, as this is more compact and human-readable than escaping each one individually as<
. The parser handles CDATA correctly when reading feature names and values but not theTextWithNodes
.This problem is not generally apparent as the serialiser only uses CDATA when there are lots of less-thans within a single span of text - if you were to save an ANNIE-processed document with a big run of
<<<<<<<<<<
, each symbol would be a separateToken
and thus there would be emptyNode
elements between each pair and no single run would have more than one<
. However if the same document were saved as GATE XML with minimal annotations (e.g. just human-annotated entities and no Tokens) then it would hit this bug.