amorlzu / pugixml

Automatically exported from code.google.com/p/pugixml
0 stars 0 forks source link

PCDATA with illegal characters lead to not well-formed XML file #235

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
Set the value of a PCDATA node with a string containing bytes outside the XML 
character range (see http://www.w3.org/TR/REC-xml/#NT-Char). Then the generated 
XML document will not be well-formed, because an escape sequence like  
will be generated.
See also: http://www.w3.org/TR/REC-xml/#sec-references

What is the expected output? What do you see instead?
The invalid characters should be skipped. The strconv escaping function should 
validate against the character range first. The resulting file should be 
well-formed.

Which version of pugixml are you using? On what operating system/compiler?
1.2, windows & linux, msvc and gcc

Original issue reported on code.google.com by kl.andr...@gmail.com on 17 Jul 2014 at 3:54

GoogleCodeExporter commented 9 years ago
pugixml parser does not perform Unicode codepoint validation by design.

Right now the data that you save roundtrips properly - i.e. you can open the 
XML file with pugixml, as well as any other parser that does not perform 
codepoint range validation. In my opinion this is better than skipping data 
that the user wanted to write to the XML.

Since it is impossible to write this kind of data to a well-formed XML, the 
user has to, uh, not write this data at all.

Original comment by arseny.k...@gmail.com on 10 Aug 2014 at 9:50