RazrFalcon / roxmltree

Represent an XML document as a read-only tree.
Apache License 2.0
434 stars 37 forks source link

roxmltree::Node::text swallows `\r` #102

Closed WhyNotHugo closed 1 year ago

WhyNotHugo commented 1 year ago

I'm using roxmltree to parse some responses from CalDav servers.

I've recently noticed that roxmltree::Node::text and roxmltree::Node::text_storage "swallow" all \r characters. That is, given this input:

&body = "<?xml version='1.0' encoding='utf-8'?>\n<multistatus xmlns=\"DAV:\" xmlns:C=\"urn:ietf:params:xml:ns:caldav\"><response><href>/test/KXdlZlBBddMCmO8Z/vCw53vKC8poh.ics</href><propstat><prop><getetag>\"4a978b6cc52d2a660ca403c4e3ca47fbde34940255815aaa89637c2b1bf50e85\"</getetag><C:calendar-data>BEGIN:VCALENDAR\r\nVERSION:2.0\r\nPRODID:-//hacksw/handcal//NONSGML v1.0//EN\r\nBEGIN:VEVENT\r\nUID:dvppRGxM7Ane\r\nDTSTART:19970714T170000Z\r\nDTSTAMP:19970610T172345Z\r\nSUMMARY:hello\r\nEND:VEVENT\r\nEND:VCALENDAR\r\n</C:calendar-data></prop><status>HTTP/1.1 200 OK</status></propstat></response></multistatus>"

The following snippet:

let text = response
    .descendants()
    .find(|node| node.tag_name() == *property)
    .ok_or(DavError::InvalidResponse(
        format!("missing {} in response", property.name()).into(),
    ))?
    .text();
let text2 = response
    .descendants()
    .find(|node| node.tag_name() == *property)
    .ok_or(DavError::InvalidResponse(
        format!("missing {} in response", property.name()).into(),
    ))?
    .text_storage();
dbg!(&text, &text2);

Yields:

[libdav/src/dav.rs:881] &text = Some(
    "BEGIN:VCALENDAR\nVERSION:2.0\nPRODID:-//hacksw/handcal//NONSGML v1.0//EN\nBEGIN:VEVENT\nUID:dvppRGxM7Ane\nDTSTART:19970714T170000Z\nDTSTAMP:19970610T172345Z\nSUMMARY:hello\nEND:VEVENT\nEND:VCALENDAR\n",
)
[libdav/src/dav.rs:881] &text2 = Some(
    Owned(
        "BEGIN:VCALENDAR\nVERSION:2.0\nPRODID:-//hacksw/handcal//NONSGML v1.0//EN\nBEGIN:VEVENT\nUID:dvppRGxM7Ane\nDTSTART:19970714T170000Z\nDTSTAMP:19970610T172345Z\nSUMMARY:hello\nEND:VEVENT\nEND:VCALENDAR\n",
    ),
)

The \r character is significant in icalendar, which is what is contained inline in this situation:

The iCalendar object is organized into individual lines of text, called content lines. Content lines are delimited by a line break, which is a CRLF sequence (CR character followed by LF character).

WhyNotHugo commented 1 year ago

I see that this behaviour was introduced in the initial commit: 45a0a40181add7defdaa128416ade622942dfaa1:

// \r in \r\n should be ignored.

I know it's been a while, but do you have context for this? I'm pretty sure that this behaviour is wrong, since it discards meaningful data inside text nodes. Both caldav and carddav rely on being able to encode \r\n in the node's text.

WhyNotHugo commented 1 year ago

Huh, this XML spec says that this replacement MUST be done for "entities": https://www.w3.org/TR/xml/#sec-line-ends

It text itself also considered an entity?

I can't imagine that escaping the text is also implied here, since that would make it impossible to properly parse a CalDav/CardDav response.

WhyNotHugo commented 1 year ago

Testing a couple of other XML parsers, the \r\n is properly retained.

Nextcloud, Xandikos, Radicale and Baikal all return XML data with \r\n in the text (when fetching icalendar components) and dropping the \r would corrupt the components. Given the large variety of clients that work with these servers, I'm going to assume that their XML parsers aren't dropping the \r.

So at this point, I'm convinced that the \r should not be dropped (and that this is actually a bug). But I'm still somewhat confused by the comment in https://www.w3.org/TR/xml/#sec-line-ends

WhyNotHugo commented 1 year ago

Oh, finally found something useful:

  Finally, when used in a calendaring REPORT response, the CALDAV:
  calendar-data XML element specifies the content of a calendar
  object resource.  Given that XML parsers normalize the two-
  character sequence CRLF (US-ASCII decimal 13 and US-ASCII decimal
  10) to a single LF character (US-ASCII decimal 10), the CR
  character (US-ASCII decimal 13) MAY be omitted in calendar object
  resources specified in the CALDAV:calendar-data XML element.

So it looks like the other XML parsers that I tried are technically not compliant.

Sorry for the noise, and hopefully this will help some other confused soul some day.

WhyNotHugo commented 1 year ago

Actually, the above merely says that the server may omit it. I see that \r\n is retained properly in CDATA, so I'd rather leave this open since I'm still not 100% sure if the parser's behaviour is correct here or not.

RazrFalcon commented 1 year ago

Replacing \r\n with \n is the correct behavior for an XML parser. We do have a test for this: tests/files/text_007.xml Python's lxml library would replace it as well. You can try it using:

cd testing-tools
python3 lxml-ast.py ../tests/files/text_007.xml
Document:
  - Element:
      tag_name: root
      children:
        - Text: "\n\n"

We could add an option to ignore this replacement, but only if you can guarantee that this is required by the caldav spec.

RazrFalcon commented 1 year ago

From the XML spec:

The presence of #xD in the above production is maintained purely for backward compatibility with the First Edition. As explained in 2.11 End-of-Line Handling, all #xD characters literally present in an XML document are either removed or replaced by #xA characters before any other processing is done. The only way to get a #xD character to match this production is to use a character reference in an entity value literal.

WhyNotHugo commented 1 year ago

The CalDav spec doesn’t require an exception on the parser. So I think that the current behaviour is correct.

So I’ve resorted to replacing \n with \r\n to restore the original data on my side.

I’m noticing that the \r\n is retained for CDATA, so I kinda have to do some special matching to restore these. There’s no way to tell of the text comes from regular text or CDATA, right?

RazrFalcon commented 1 year ago

Yes, CDATA nodes are not preserved. Again, just like in lxml.

Also, lxml processed \r even in CDATA. That's a bug...

RazrFalcon commented 1 year ago

Created #103