Closed WhyNotHugo closed 1 year ago
I see that this behaviour was introduced in the initial commit: 45a0a40181add7defdaa128416ade622942dfaa1:
// \r in \r\n should be ignored.
I know it's been a while, but do you have context for this? I'm pretty sure that this behaviour is wrong, since it discards meaningful data inside text nodes. Both caldav and carddav rely on being able to encode \r\n
in the node's text.
Huh, this XML spec says that this replacement MUST be done for "entities": https://www.w3.org/TR/xml/#sec-line-ends
It text
itself also considered an entity?
I can't imagine that escaping the text is also implied here, since that would make it impossible to properly parse a CalDav/CardDav response.
Testing a couple of other XML parsers, the \r\n
is properly retained.
Nextcloud, Xandikos, Radicale and Baikal all return XML data with \r\n
in the text (when fetching icalendar components) and dropping the \r
would corrupt the components. Given the large variety of clients that work with these servers, I'm going to assume that their XML parsers aren't dropping the \r
.
So at this point, I'm convinced that the \r
should not be dropped (and that this is actually a bug). But I'm still somewhat confused by the comment in https://www.w3.org/TR/xml/#sec-line-ends
Oh, finally found something useful:
Finally, when used in a calendaring REPORT response, the CALDAV: calendar-data XML element specifies the content of a calendar object resource. Given that XML parsers normalize the two- character sequence CRLF (US-ASCII decimal 13 and US-ASCII decimal 10) to a single LF character (US-ASCII decimal 10), the CR character (US-ASCII decimal 13) MAY be omitted in calendar object resources specified in the CALDAV:calendar-data XML element.
So it looks like the other XML parsers that I tried are technically not compliant.
Sorry for the noise, and hopefully this will help some other confused soul some day.
Actually, the above merely says that the server may omit it. I see that \r\n
is retained properly in CDATA, so I'd rather leave this open since I'm still not 100% sure if the parser's behaviour is correct here or not.
Replacing \r\n
with \n
is the correct behavior for an XML parser. We do have a test for this: tests/files/text_007.xml
Python's lxml
library would replace it as well. You can try it using:
cd testing-tools
python3 lxml-ast.py ../tests/files/text_007.xml
Document:
- Element:
tag_name: root
children:
- Text: "\n\n"
We could add an option to ignore this replacement, but only if you can guarantee that this is required by the caldav spec.
From the XML spec:
The presence of #xD in the above production is maintained purely for backward compatibility with the First Edition. As explained in 2.11 End-of-Line Handling, all #xD characters literally present in an XML document are either removed or replaced by #xA characters before any other processing is done. The only way to get a #xD character to match this production is to use a character reference in an entity value literal.
The CalDav spec doesn’t require an exception on the parser. So I think that the current behaviour is correct.
So I’ve resorted to replacing \n with \r\n to restore the original data on my side.
I’m noticing that the \r\n is retained for CDATA, so I kinda have to do some special matching to restore these. There’s no way to tell of the text comes from regular text or CDATA, right?
Yes, CDATA nodes are not preserved. Again, just like in lxml
.
Also, lxml
processed \r
even in CDATA. That's a bug...
Created #103
I'm using
roxmltree
to parse some responses from CalDav servers.I've recently noticed that
roxmltree::Node::text
androxmltree::Node::text_storage
"swallow" all\r
characters. That is, given this input:The following snippet:
Yields:
The
\r
character is significant in icalendar, which is what is contained inline in this situation: