RazrFalcon / roxmltree

Represent an XML document as a read-only tree.
Apache License 2.0
434 stars 37 forks source link

Possibly incorrect "children()" behaviour #90

Closed ChillyVanilly77 closed 1 year ago

ChillyVanilly77 commented 1 year ago

Hi!

I just ran a unit-test while parsing some XML file and I noticed something weird which I am seeing as possibly incorrect behaviour, but that is where I wish for you to tell me whether that is the case or not please.

The XML snippet in question:

<scene output_file="myImage.png">
    <background_color r="255" g="0" b="0"/>
    <camera>
        <position x="1.0" y="-2.0E-10" z="-3"/>
        <lookat x="1" y="2" z="3"/>
        <up x="1" y="2" z="3"/>
        <horizontal_fov angle="90"/>
        <resolution horizontal="1920" vertical="1080"/>
        <max_bounces n="100"/>
    </camera>
... (other elements omitted)
</scene>

code snippet in question:

assert!(
            camera_elem.children().count() == 6,
            // camera_elem.children().filter(|node| node.is_element()).count() == 6,
            "There WASN'T 6 child elements of the `camera` element! Instead, found the following children:
            {:?}", camera_elem.children().map(|child_elem| child_elem.tag_name().name()).collect::<Vec<&str>>()
        );

Failed test-assertion response of mine:

There WASN'T 6 child elements of the `camera` element! Instead, found the following children:
            ["", "position", "", "lookat", "", "up", "", "horizontal_fov", "", "resolution", "", "max_bounces", ""]

Since there is no NodeType::Attribute (which of course could be that this isn't even meant to exist according to the XML specification, but I don't know that at this point), I was a bit confused, what would those empty string-slices refer to exactly? And since the camera's children are all EmptyElemTag (as specified here), there should not be Text-nodes, right?

Cheers!

adamreichold commented 1 year ago

My guess is these are text node children of the camera element which correspond to the indentation in your example (camera itself is not EmtpyElemTag).

ChillyVanilly77 commented 1 year ago

Oh, so if >= 1 whitespace character(s) is/are detected right after a tag, it is interpreted/considered as a TextNode? Hmm...interesting, that I did not consider. But this is desired and spec-adhereing behaviour then, is it?

tomjw64 commented 1 year ago

It is indeed the indentation:

// Using roughly the same markup...
dbg!(camera_elem.children().filter(|n| n.is_text()).map(|n| n.text()).collect::<Vec<_>>());
// => [
//     Some(
//         "\n\t\t",
//     ),
//     Some(
//         "\n\t\t",
//     ),
//     Some(
//         "\n\t\t",
//     ),
//     Some(
//         "\n\t\t",
//     ),
//     Some(
//         "\n\t\t",
//     ),
//     Some(
//         "\n\t\t",
//     ),
//     Some(
//         "\n\t",
//     ),
// ]

To some, this whitespace is annoying and insignificant, but AFAIK there is no such thing as "insignificant whitespace" as far as the XML spec is concerned. From the same link you provided:

An XML processor MUST always pass all characters in a document that are not markup through to the application.

And as you noted, roxmltree exhibits this conforming behavior by providing these characters in a text node.

RazrFalcon commented 1 year ago

As others mentioned, this is the correct behavior. Unlike the popular believe, there are no indention in XML. All of those leading spaces are text nodes.

Some parsers might replaces multiple whitespaces with one, accounting the xml:space attribute, but a text node remains. Same with comments between elements.

What you want is #91, which hopefully will be added soon.

Afaik, whitespaces trimming in XML is domain specific. An XML library can't handle it on its own. For example, an SVG like this:

<text>
    Text
</text>

should be parsed as Text. No spaces and new lines at all.

But something like this:

<text>
    Some
    <tspan>text</tspan>
</text>

should be parsed as Some text. Now with a space. Why? Because XML.

ChillyVanilly77 commented 1 year ago

Ah yes, I was just about to ask regarding Node::children_elements for convenience, since this may not be apparent to some.

Although I did read ~40% of the XML spec and used it for anchor-tag parsing with RegEx, I do not know it "by heart" or as details as you :))

Thank you all for the amazing responses, infos and clarifications! Cheers!

RazrFalcon commented 1 year ago

In the end, I think the right way to do this is to use:

for element in node.children().filter(Node::is_element) {
    println!("{:?}", element.tag_name());
}

Sure, it's a bit verbose, but much more flexible. The main idea behind roxmltree is that it doesn't follow web/js DOM api and instead relies heavily on Rust's Iterator.

It's even mentioned in the readme, but even I somehow forgot about it...

We even have a dedicated example just for your case. I completely forgot about it. Well, it was written 4 year ago.

ChillyVanilly77 commented 1 year ago

Yes yes I did notice that, though it seems at times as if it was supposed to be a flexible mix of raw node access & convenience where needed :) But yeah, I went with that option (as was commented out in the code snippet) ^^