Attribute value normalization is performed incorrectly for character references

dbuenzli commented 10 years ago

Related to #5 (cc @chrbauer).

In fact according to the spec any character reference should be appended to the attribute value as is (and thus translation of white space character other than U+0020 to U+0020 should not happen on character references – but still on characters). The verbiage is here:

Note that if the unnormalized attribute value contains a character reference to a white space character other than space (#x20), the normalized value contains the referenced character itself (#xD, #xA or #x9). This contrasts with the case where the unnormalized value contains a white space character (not a reference), which is replaced with a space character (#x20) in the normalized value and also contrasts with the case where the unnormalized value contains an entity reference whose replacement text contains a white space character; being recursively processed, the white space character is replaced with a space character (#x20) in the normalized value.

However in Xmlm we apply the normalization to U+0020 to any reference (character or entity), see this line. This behaviour is correct for entity references but not for character references, we need to distinguish between the two.

This means that the following is wrong:

# let s = Xmlm.make_input (`String (0, "<e a='&#x0A;v'/>"));;
val s : Xmlm.input = <abstr>
# Xmlm.input s;;
- : Xmlm.signal = `Dtd None
# Xmlm.input s;;
- : Xmlm.signal = `El_start (("", "e"), [(("", "a"), "v")])

The attribute value should be "\nv", regardless of whether we perform whitespace collapsing and stripping or not.

Damned I always thought I had the done the crazy xml whitespace thing right.

dbuenzli commented 10 years ago

Now that is annoying since the strip argument said that when true we perform the same normalization as in attributes; note this is not standard whatsoever this is a mere convenience, the only thing the standard says about whitespace in character data is this, the strip argument did somehow do part of the application job of ignoring surrounding whitespace around markup while respecting the xml:space attribute for you since it is so common.

I highly suspect that if I change character data parsing to the above attribute data behaviour it may break assumptions people made while parsing with strip:true. Here are a few possibilities to be thought together with the problem of #5 which I was planning to solve by introducing a ?strip_atts optional attribute

Uniform and correct. Fix the bug and make the same behaviour on character data. Introduce the ?strip_atts optional attribute to disable leading and trailing U+0020 stripping and U+0020 collapsing.
Non-uniform and correct. Fix the bug for attribute data, keep the existing behaviour on character data. Introduce the ?strip_atts optional attribute to disable leading and trailing U+0020 stripping and U+0020 collapsing.
Fully flexible. By default keep the current wrong behaviour. Introduce flags so you can choose
1. Current behaviour on attribute data or correct behaviour.
2. Disable leading and trailing U+0020 stripping and U+0020 collapsing in attribute data.
3. Current behaviour on character data or same as attribute data.

Aaargh this was one of the few things to get right and I managed to get it wrong. @dsheets, @edwintorok any thoughts on that ?

dbuenzli commented 10 years ago

Having the correct behaviour of attribute data normalization in character data would enable to solve the problem of @chris00 in issue #2 (albeit not in the way he wanted to, my statement about CDATA is still true but he could solve it by having explicit &#xA for the newlines, though in that case it seems preferable to use an "xml:space=preserve attribute" on the element).

edwintorok commented 10 years ago

On 10/01/2014 11:54 PM, Daniel Bünzli wrote:

Aaargh this was one of the few things to get right and I managed to get it wrong. @dsheets https://github.com/dsheets, @edwintorok https://github.com/edwintorok any thoughts on that ?

I'm not familiar with all the tricky whitespace rules in XML, but I wouldn't expect the parser to collapse whitespace inside attributes by default, or there should be a way to turn it off. I don't have a strong opinion whether there should be one attribute (?strip) or two (?strip and ?strip_atts) to control this behaviour.

I'm not sure about the flag for the wrong behaviour: I can see the value for not breaking backward compatibility, but by that reasoning there should be a flag to turn each bug on/off ... so even if you introduce it please mark it as deprecated and remove after a few more cycles.

Best regards, --Edwin

dbuenzli / xmlm

Attribute value normalization is performed incorrectly for character references #7