KWARC / rust-libxml

Rust wrapper for libxml2
https://crates.io/crates/libxml
MIT License
76 stars 38 forks source link

Node Iterators #71

Closed npajkovsky closed 4 years ago

npajkovsky commented 4 years ago

Hello,

thanks for very good library. Is there any chance, that you can implement iterators for Node and RoNode?

dginev commented 4 years ago

Hi @npajkovsky and thank you!

What types of iterators are you looking for? Depth-first and breadth-first traversals over a document's nodes? Or is there some other use case you have in mind.

Currently the wrapper provides various calls that return vectors of Nodes (Vec<Node>) e.g. when getting children, or calling an xpath, which you can easily iterate on after. Global document traversal didn't originally exist in libxml2 (if I remember correctly) so it wasn't ported over here. It's usually just a few lines of code to write a loop that traverses in the order you're interested in, so I didn't consider writing an API extension before. Could you tell me a little more about what you expect a node iterator to be used for? And how it is different from say node.get_child_nodes().iter()

npajkovsky commented 4 years ago

Hi @npajkovsky and thank you!

What types of iterators are you looking for? Depth-first and breadth-first traversals over a document's nodes? Or is there some other use case you have in mind.

I would like to have non-recursive breadth-first traversal.

Currently the wrapper provides various calls that return vectors of Nodes (Vec<Node>) e.g. when getting children, or calling an xpath, which you can easily iterate on after.

I had to overlook Vec calls. I will give them shot.

Global document traversal didn't originally exist in libxml2 (if I remember correctly) so it wasn't ported over here.

Thats correct.

It's usually just a few lines of code to write a loop that traverses in the order you're interested in, so I didn't consider writing an API extension before.

It's super easy to write recursive in-order , but writing non-recursive Depth-first or breadth-first is not that fun.

Could you tell me a little more about what you expect a node iterator to be used for? And how it is different from say node.get_child_nodes().iter()

I'm using right now roxmltree, but library is not production ready. Uzipped xml file is about 982M 20200331_OB_554782_UZSZ.xml took 16GB RAM + 8GB swap and it literally hurt to parse it.

 https://vdp.cuzk.cz/vymenny_format/soucasna/20200331_OB_554782_UZSZ.xml.zip

I'm using like that.

    let mut xml_iter = doc.descendants();
    loop {
        let node = match xml_iter.next() {
            Some(v) => v,
            None => break,
        };

        if !node.is_element() {
            continue;
        }
        if node.tag_name().namespace() != Some(xml_parser::XMLNS_VF) {
            continue;
        }
        let rec = match xml_parser::parse_vf_element(node) {
            Some(r) => r,
            None => continue,
        };

But thinking about it, traversing Node which I'm not interested is waste of clocks and I should probably be good with node.get_child_nodes().iter().

dginev commented 4 years ago

Right, so for this use case you can write the traversal relatively simply, and as you say code which nodes to skip and which to include according to your particular application logic. If you have a structural pattern using the xpath engine will be even faster / simpler in many cases.

I will close here for now, since going too far beyond libxml itself is modestly out of scope for the wrapper crate, ideally we keep it minimal, when reasonable.