Serializing of Unicode Characters Must Be Regulated

Shinmera commented 9 years ago

The XML Specification disallows the usage of certain Unicode characters in an output document. Plump should recognise these and properly encode them when serializing the output.

ruricolist commented 9 years ago

I'm looking into implementing this, but before I begin, let me ask: how XML-compatible do you actually want to be?

I think a reasonable amount of checking would be:

Characters in text nodes are allowed.
Element and attribute names match the NCName production.
The forbidden string -- does not occur in a comment.

Shinmera commented 9 years ago

My thoughts are as follows:

When serialising text-nodes or fulltext-nodes, check each character for whether it fits into the allowed ranges. Each check is surrounded by a restart to put it into the document anyway, or to skip it. If a discouraged character is encountered, a warning type is signalled and the character is dropped from the output as a default. If a disallowed character is encountered, an error type is signalled.
If -- occurs in a comment, signal a warning with the same restart behaviour as before and drop it as default behaviour.

I don't really care about attribute name or tag name checking to be honest. I only really care about the former because it's a sneaky issue that has bitten me a couple of times now when I innocently spliced text into a document, and the -- for comments falls into a similar vein. Attributes and tags should not need a check, since those are usually not generated by user content, but instead controlled by the author, who should know better.

I'm also not sure if there are spec differences between HTML5 and XML with regard to tag and attribute names. If they are compatible, then adding a check regardless could be worthwhile, but if one of them is much more lenient than the other I wouldn't bother.

Also: The checks for character range validity should be encapsulated as separate, exported functions (from the plump-dom package) to allow users to easily check or trim the DOM ahead of time, or something like that.

Shinmera / plump

Serializing of Unicode Characters Must Be Regulated #3