Open dbuenzli opened 10 years ago
Now that is annoying since the strip
argument said that when true
we perform the same normalization as in attributes; note this is not standard whatsoever this is a mere convenience, the only thing the standard says about whitespace in character data is this, the strip
argument did somehow do part of the application job of ignoring surrounding whitespace around markup while respecting the xml:space
attribute for you since it is so common.
I highly suspect that if I change character data parsing to the above attribute data behaviour it may break assumptions people made while parsing with strip:true
. Here are a few possibilities to be thought together with the problem of #5 which I was planning to solve by introducing a ?strip_atts
optional attribute
?strip_atts
optional attribute to disable leading and trailing U+0020 stripping and U+0020 collapsing.?strip_atts
optional attribute to disable leading and trailing U+0020 stripping and U+0020 collapsing.Aaargh this was one of the few things to get right and I managed to get it wrong. @dsheets, @edwintorok any thoughts on that ?
Having the correct behaviour of attribute data normalization in character data would enable to solve the problem of @chris00 in issue #2 (albeit not in the way he wanted to, my statement about CDATA
is still true but he could solve it by having explicit 

for the newlines, though in that case it seems preferable to use an "xml:space=preserve attribute"
on the element).
On 10/01/2014 11:54 PM, Daniel Bünzli wrote:
Aaargh this was one of the few things to get right and I managed to get it wrong. @dsheets https://github.com/dsheets, @edwintorok https://github.com/edwintorok any thoughts on that ?
I'm not familiar with all the tricky whitespace rules in XML, but I wouldn't expect the parser to collapse whitespace inside attributes by default, or there should be a way to turn it off. I don't have a strong opinion whether there should be one attribute (?strip) or two (?strip and ?strip_atts) to control this behaviour.
I'm not sure about the flag for the wrong behaviour: I can see the value for not breaking backward compatibility, but by that reasoning there should be a flag to turn each bug on/off ... so even if you introduce it please mark it as deprecated and remove after a few more cycles.
Best regards, --Edwin
Related to #5 (cc @chrbauer).
In fact according to the spec any character reference should be appended to the attribute value as is (and thus translation of white space character other than U+0020 to U+0020 should not happen on character references – but still on characters). The verbiage is here:
However in Xmlm we apply the normalization to U+0020 to any reference (character or entity), see this line. This behaviour is correct for entity references but not for character references, we need to distinguish between the two.
This means that the following is wrong:
The attribute value should be
"\nv"
, regardless of whether we perform whitespace collapsing and stripping or not.Damned I always thought I had the done the crazy xml whitespace thing right.