Closed GoogleCodeExporter closed 8 years ago
Original comment by mgron...@gmail.com
on 6 May 2009 at 8:40
Original comment by mgron...@gmail.com
on 6 May 2009 at 9:36
The issue is that most control characters are not valid in XML documents. See
http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
There's a slight mistranslation in CL 6483676, which uses this data directly.
But
Java uses UTF-16 internally, so Unicode characters U+10000 to U+10FFFF are
encoded
using the surrogate characters excluded above, 0xD800 to 0xDFFF. So we should
allow
just 0x09, 0x0A, 0x0D, and the range 0x20 to 0xFFFD.
If the invalid characters appear in property names should they be encoded as
underscores (issue 150) or dropped? In order words, should this filtering also
occur
on property names before they are underscore-escaped? (Note that the other
order is
meaningless; once the names are underscore-escaped, no invalid XML characters
would
remain for this filtering.)
Original comment by jl1615@gmail.com
on 9 May 2009 at 11:46
Fixed in r2000.
Issue 150 was dropped (WontFix), so there's no interference there.
Original comment by jl1615@gmail.com
on 16 May 2009 at 12:56
Original issue reported on code.google.com by
jl1615@gmail.com
on 7 Mar 2009 at 7:43