Shinmera / plump

Practically Lenient and Unimpressive Markup Parser for Common Lisp
https://shinmera.github.io/plump
zlib License
119 stars 21 forks source link

Serializing of Unicode Characters Must Be Regulated #3

Closed Shinmera closed 9 years ago

Shinmera commented 9 years ago

The XML Specification disallows the usage of certain Unicode characters in an output document. Plump should recognise these and properly encode them when serializing the output.

ruricolist commented 9 years ago

I'm looking into implementing this, but before I begin, let me ask: how XML-compatible do you actually want to be?

I think a reasonable amount of checking would be:

Shinmera commented 9 years ago

My thoughts are as follows:

I don't really care about attribute name or tag name checking to be honest. I only really care about the former because it's a sneaky issue that has bitten me a couple of times now when I innocently spliced text into a document, and the -- for comments falls into a similar vein. Attributes and tags should not need a check, since those are usually not generated by user content, but instead controlled by the author, who should know better.

I'm also not sure if there are spec differences between HTML5 and XML with regard to tag and attribute names. If they are compatible, then adding a check regardless could be worthwhile, but if one of them is much more lenient than the other I wouldn't bother.

Also: The checks for character range validity should be encapsulated as separate, exported functions (from the plump-dom package) to allow users to easily check or trim the DOM ahead of time, or something like that.