PILLUTLAAVINASH / google-enterprise-connector-manager

Automatically exported from code.google.com/p/google-enterprise-connector-manager
0 stars 0 forks source link

Check control characters in the feed #128

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
See Google bug #912246 and CL 6483676. It seems like we should be filtering 
control characters 
from the metadata values to ensure that the feed XML is valid.

Original issue reported on code.google.com by jl1615@gmail.com on 7 Mar 2009 at 7:43

GoogleCodeExporter commented 8 years ago

Original comment by mgron...@gmail.com on 6 May 2009 at 8:40

GoogleCodeExporter commented 8 years ago

Original comment by mgron...@gmail.com on 6 May 2009 at 9:36

GoogleCodeExporter commented 8 years ago
The issue is that most control characters are not valid in XML documents. See

http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

There's a slight mistranslation in CL 6483676, which uses this data directly. 
But
Java uses UTF-16 internally, so Unicode characters U+10000 to U+10FFFF are 
encoded
using the surrogate characters excluded above, 0xD800 to 0xDFFF. So we should 
allow
just 0x09, 0x0A, 0x0D, and the range 0x20 to 0xFFFD.

If the invalid characters appear in property names should they be encoded as
underscores (issue 150) or dropped? In order words, should this filtering also 
occur
on property names before they are underscore-escaped? (Note that the other 
order is
meaningless; once the names are underscore-escaped, no invalid XML characters 
would
remain for this filtering.)

Original comment by jl1615@gmail.com on 9 May 2009 at 11:46

GoogleCodeExporter commented 8 years ago
Fixed in r2000.

Issue 150 was dropped (WontFix), so there's no interference there.

Original comment by jl1615@gmail.com on 16 May 2009 at 12:56