Logic for drop - Githubissues

triptec commented 6 years ago

So I think I rather like to know how we should solve this. Now there's no drop logic for nodes at all.

Lets start with my use case, I traverse a document, find node that I want to remove and call unlink() on them. As it is today those unlinked nodes gives a memory leak.

What I would propose is to add a pub fn remove(self){} to node, that would both unlink and free.

lweberk commented 6 years ago

The drop could check for the existence of a parent and/or document reference pointer.

If parent node is none (NULL), then we can safely assume we are a root node thus having to free ourself and our children (recursively done by xmlFreeNode). Otherwise the node is a child to another node in the tree which would be responsible of cleaning us and our children up instead.

What I could not figure out is, whether xmlUnlinkNode also severs the relationship of the node to the document context (node->doc relationship). It would be logical, but I'm not 100% on that. It is trivially testable though (by checking if node->doc pointer is NULL after unlinking).

If everything above holds, then detach/unlink with an implicit deletion via drop might be better? It gives both a way to "unlink and move/consume" as well as "unlink and ignore" with drop as end result, both in one interface.

triptec commented 6 years ago

I spent some time reading the libxml2 source yesterday and what I found was that there wasn't a strait forward way to detect if a node has been detached/unlinked as there are nodes that haven't got a parent, namely namespace nodes(https://github.com/GNOME/libxml2/blob/790c230fcfbc3d9d5cc272479dc6d0d8765fb9d7/tree.c#L3806). So that at least we would have to check. And maybe more.

I was kind of leaning to adding a detached or something to the Node struct but that doesn't seem right either so I just kinda gave up and have my own unlink that also frees. =(

dginev commented 6 years ago

Did you already get a chance to see my attempt at implementing a drop for Documents in global.rs? https://github.com/KWARC/rust-libxml/blob/master/src/global.rs

The idea there was that we can bookkeep on the Rust side of things when the last libxml object is dropped from Rust, so that we perform the global clean operations on the C-side.

I think that approach could be extended to bookkeeping individual pointers, or we can use something less homebacked such as Rc, to guarantee knowledge of the final drop of a Node.

Btw I am aware of namespaces and attributes also being nodes in libxml2, but the great news is - they are different structs in the Rust wrapper. So when we implement Drop logic for a Node struct, we only need to think of actual XML nodes. We can (and will have to) revisit drops for all other datastructures, definitely Namespace, and possibly Attribute - if I remember correctly the wrapper has shied away from exposing attributes too deeply for now, and generally passes around string values for simplicity (and memory sanity).

Thanks for pursuing this!

dginev commented 6 years ago

I can take a stab at this soon myself if preferred btw, my main goal being that I would love if Rust authors do not have to think of the C datastructures, and can pretend to be working in pure Rust. And one never explicitly frees in pure Rust. It is a major relief when "drops just work" -- btw something you can see in all high level wrappers.

triptec commented 6 years ago

Yeah, I'm all for getting drop to work and having a high level api, but I'm also using the code currently and I had memory leaks, and right now I think I would have to unlink the node and add it to another document that gets dropped to not get the leak I think and that's not ideal.

dginev commented 6 years ago

I have been looking into this and there seem to be two very distinct paths forward to make Drops leak-free, and methods like unlink safe in Rust.

The first path is the one we discussed, where it is the author's responsibility to remember to free after an unlink, when the intent is to remove the node permanently. And to remember not to when moving to a different fragment.

The second path is the retroactively fit into Rust approach, where we redesign the crate to ensure you can never have thing such as two or more mutable references (or worse - owned vars) that point into the same C datastructure. Currently that is not only possible, but in a way intended, as the wrapper is tremendously lazy and doesn't guarantee any meaningful safety. For example, this is a passing test on my refactoring branch:

//! Enforce Rust ownership pragmatics for the underlying libxml2 objects

extern crate libxml;

use libxml::parser::Parser;

#[test]
fn ownership_guards() {
  // Setup
  let parser = Parser::default();
  let doc_result = parser.parse_file("tests/resources/file01.xml");
  assert!(doc_result.is_ok());
  let doc = doc_result.unwrap();
  let root = doc.get_root_element();

  let mut first_a = root.get_first_element_child().unwrap();
  let first_b = root.get_first_element_child().unwrap();

  assert_eq!(
    first_a.get_attribute("attribute"),
    Some(String::from("value"))
  );
  assert_eq!(
    first_b.get_attribute("attribute"),
    Some(String::from("value"))
  );

  first_a.set_attribute("attribute", "newa");

  assert_eq!(
    first_a.get_attribute("attribute"),
    Some(String::from("newa"))
  );
 // This currently PASSES - which is WRONG - we need 
 //1) compile error when first_b is assigned
 // and
 //2) it should never be possible that an immutable libxml variable changes value
  assert_eq!(
    first_b.get_attribute("attribute"),
    Some(String::from("newa"))
  );

}

In Rust terms, this is terrible. We can call root.get_first_element_child() as many times as we want, getting a new Node object each call, that wraps over the same C pointer. It is enough to have one of them mutable, and issue a change to change the underlying values in the other objects, as in the test.

I think if we want to have Rust-compliant safety, we need to ensure:

Only one owned Rust object per C pointer
In other words, no Node/Namespace/Attribute "getter" result can be an owned object, and we would need the mutable and immutable getter methods - e.g. the get_first_child and get_first_child_mut variants returning &Node and &mut Node.
The only exception here is Document, as creating a new Document always allocates a fresh C datastructures. So we end up with document ownership + component borrowing.
To enable this, we need a "memory arena" in each Document, to which we can assign the nodes, namespaces, attributes. My first idea is to do per-type HashMaps keyed by the pointer value via to_hashable.
This makes Drop somewhat trivial, as there will be a single Drop per Node, triggered as the Document is dropped, which can check if they are bound to a document/fragment, and free the memory otherwise
For freshly created Node objects that will be inserted in documents, the situation is similar - if they get assigned, the Drop will be triggered after the document Drop, and if they end up unassigned, we get one Drop as the node goes out of scope and the memory needs to be freed.

This will require a 0.2 release since it is a major API overhaul, but it makes a lot of questions easy to answer - and more importantly gets the crate very close to feeling Rust native. Opinions welcome, I will get more examples fleshed out soon.

dginev commented 6 years ago

Playing around with this idea, and indeed the "memory arena" that will own all Nodes can not reside inside the documents, because that would render the Document API rather useless - any getter method could be creating a new Rust Node, if it has to wrap over a previously unseen node coming from C, but to store that into the memory arena one would need to make the getter call over a mutable reference of the Document -- that means calling several getters will no longer be allowed, since they would be considered mutable calls into Document. Dead end.

Instead, I will try the approach I have in the global mod, with a single, global, mutable memory arena, which will host all components. Luckily the pointers themselves are globally unique as they are unique addresses in physical memory, but I need to try my hand at implementing this to see if it actually works well - or it has other pitfalls.

I still think the overall refactor is worth it, as getting the Rust guarantees automatically makes the rest of the crate much more idiomatic to right, and we will have standard answers for all further API questions.

triptec commented 6 years ago

Commented on the wrong issue before..

Have you taken a look at how sxd-document (https://github.com/shepmaster/sxd-document/blob/master/src/dom.rs), Kuchiki (https://github.com/kuchiki-rs/kuchiki/blob/master/src/tree.rs). I believe we would be able to take some inspiration from them.

But also I actually think we should do those things in it's own crate where we also make the ergonomics better, but use this crate to interface with libxml2 and keep this crate close to how libxml2 actually works. Thoughts on that?

dginev commented 6 years ago

If you want to make an XML crate that just focuses on the ergonomics of the high level API - go for it! That is a big part of the motivation behind most of the actual xml crates written in Rust.

As you mention, the main idea for this one is to just have a meaningful Rust wrapper over libxml2, and the ergonomics can always be added separately on top.

triptec commented 6 years ago

But is adding a "arena" and keep checks and balances on the c data structure seems to me to be more than a meaningful wrapper. But it seems we disagree on that point.

dginev commented 6 years ago

How so? My main concern at the moment is that the wrapper is not "good" Rust, as it allows for undefined/broken behavior that is impossible in safe Rust code - such as multiple mutable references to the same data structure, or silently changing values of immutable data.

Of course it's near-impossible to make a libxml2 wrapper thread safe, but this violates safety locally to a single thread too. So I would really like to fix that / make it impossible. And that would be a faithful Rust wrapper - it wouldn't add any extra functionality, but would ensure Rust-like safety over libxml2.

As we discussed in the bindgen issue, the wrapper was never meant to be a simple enumeration of the C functions, but instead a usable interface for a Rust author. I am surprised we disagree on this point.

triptec commented 6 years ago

Ok, I guess I’m worried about possible performance loss and that I could drop down to a enumeration of the c functions in that case if speed was of the essence. But you have convinced me that it’s worth a try at least to get it safe. Have you had time to try the arena approach?

dginev commented 6 years ago

Yeah, I tried it for a bit, and it's quite hard to do in a way that doesn't feel terrible...

The simple way to manage lifetimes is to keep things like Nodes stored in their Document, when they are bound. The problem is, to call any getter with that approach you need a mutable reference to the Document, which makes writing real world code impossibly annoying.
The complex way is to have a separate global memory arena that is mutable, e.g. via a lazy_static mutex, but borrowing from that is in turn crazy hard, not to mention I have to lie to the compiler that Node is Sync and Send compliant (which it of course isn't)...

So for now I'm taking a step back and thinking of some nice compromise, since changing to a completely unsuable API is a no-go.

dginev commented 6 years ago

After trying a range of approaches, I think the best way forward is to actually give up for now and shelve this entirely.

I will add some extra warnings and documentation for the current state of the crate as we exhibit all of:

No Rust safety guarantees for mutability, as well as ownership
No memory safety guarantees - it is possible to leak memory via e.g. careless use of unlink, among others.

But the price of avoiding these caveats is a major refactor that seems to make the API inconvenient to the point of being useless. The best trade-off I managed to implement was using a document-owned node model, but essentially all tree methods end up requiring &mut self invocations, which makes it very difficult to freely write libxml code. It seems a lot more realistic to make a super crate that ensures this level of safety, or to rely on a native Rust XML parser instead of the C libxml2

So I do not think we can auto-free on drop, and will revert back to making a separately named method that does both of unlink/unbind+free - maybe "destroy_node"?

triptec commented 6 years ago

@dginev can I find you in a irc channel or slack or something?

dginev commented 6 years ago

@triptec not a bad idea, I can jump on the public Rust IRC server and we meet in a #rust-libxml channel?

dginev commented 6 years ago

This seems to have worked nicely: https://client00.chat.mibbit.com/?server=irc.mozilla.org&channel=%23rust-libxml

triptec commented 6 years ago

I'm in both =)

triptec commented 6 years ago

I though you meant rust but I'm in rust-libxml now

KWARC / rust-libxml

Logic for drop #25