antlr / stringtemplate4

StringTemplate 4
http://www.stringtemplate.org
Other
956 stars 231 forks source link

StringRenderer xml-encode leads to invalid XML #260

Closed tkalmar closed 4 years ago

tkalmar commented 4 years ago

StringRenderer xml-encode leads to invalid xml

for example the String 🩳 leads to the XML-Entity: �� which is according to XML-Spec (https://www.w3.org/TR/REC-xml/#NT-Char) not in the range of allowed chars for XML (Note the ranges differ between XML 1.0 and XML 1.1) This should either be:

Im not shure if for an UTF-8 encoded XML document the encoding of <, >," and ' would be sufficent and all other characters should be passed through

parrt commented 4 years ago

Looks like it might be not hex-encoding the values.

tkalmar commented 4 years ago

No it is hex encoding the value, but the 🩳 is in its encoded form (&#55358;&#56947;) not valid for XML (see the linked spec) I'm not sure if it is legal in unencoded form for UTF-8 encoded xml files

Clashsoft commented 4 years ago

The emoji is encoded in the Java String as two chars due to the code point being over the two byte limit, hence the strange encoded form. It's called a surrogate pair. The fix would be to loop over the code points instead of the characters of the string. I'll make a PR in a bit.