How to model tree shaped data with im-rs properties?

jessegrosjean commented 5 years ago

I would like to model some tree shaped data (to model filesystem hierarchy). I would like it to supposed properties of im-rs, mainly really fast clones that share data with previous versions.

Currently I have implemented this using im::Vector as a backing store... and then using usize indexes into that vector to refer to child nodes, parent node, etc. This is working, but I wonder if it's the best approach. I don't know much about immutable data structures so I figured I would ask an expert...

If you are implementing a structure for tree shaped data in rust would you build off im::Vector or is there another approach that you would suggest?

Thanks!

bodil commented 5 years ago

(I'm closing this issue because it isn't anything actionable for im, but feel free to continue the conversation.)

If you're implementing a new data structure more or less from the ground up on top of im::Vector, you might want to just use the mechanism it relies on directly, rather than trusting it to make good decisions about the shape of your data structure rather than Vector.

It's pretty simple: Vector is implemented as if it were a mutable data structure, except that it uses Rc<Node> for pointers to child nodes instead of Box<Node>. Because Rc knows how many referents it has at any given time, it knows whether it needs to clone its contents when you need to change them - this happens automatically through Rc::make_mut, which you'd call to get a mutable reference to the child node, if you have an Rc pointer to it.

And that's really all there is to it, that's how you turn a data structure into a persistent data structure in Rust. Of course, it helps if the data structure you're using is designed for persistence - it's not magically going to make a Vec into an efficient persistent data structure, but it would make it persistent, and you'd still have cheap instant clone, you'd just have very expensive updated. If your data structure is tree shaped, on the other hand, it's probably going to work out really well.

jessegrosjean commented 5 years ago

Thanks for your response.

I "think" I know how to do a good immutable implementation if I use an Rc<Node> per node in my tree structure, but that strait forward approach also means a new allocation per item.

I think your Vector implementation goes to a lot of effort to limit per item allocations and instead groups the Vector into chunks of continuous memory and shapes those chunks into a tree structure.

I was wondering if there was some approach that was designed for tree shaped data and also limited per item allocations... but since you didn't immediately response with "no you should do it this way" I think I will continue just building on Vector for now. It seems to be working pretty well for me.

Thanks for the library and for answering my questions.

bodil commented 5 years ago

I suppose it depends on your performance requirements - my first concern is that if you're layering another data structure on top of Vector, you might get a lot of unnecessary data duplication because Vector clones at the level of a chunk of 64 items where custom nodes could control the amount of data duplication themselves. But if you're more concerned about the time impact of extra allocations, then yeah, it might be a good backing store in your case.

jessegrosjean commented 5 years ago

Does Vector recover memory of removed items when no clones are taken? So for example if I insert 120 items. Then remove all but one item... am I left with two 64 elements chunks or only one?

Also do you have thoughts on using a piece table as an alternative implementation for an immutable vector? I'm not suggesting that you change what you have, I'm just interested in general thoughts. I think this implementation approach where piece's are structured into a tree could work really well for my common case of loading a big list and then making a few random location mutations.

Like:

https://code.visualstudio.com/blogs/2018/03/23/text-buffer-reimplementation

But make it more generic, so instead of organizing text use to organize list of items just like Vector does now. You would have an append only list of items and then a tree of ranges pointing into those items. I "think" everything would be more efficient then the current approach for my use pattern... but:

You would be able to recover memory from removed items
If many out of order mutations then it would fragment and not be as efficient

bodil commented 5 years ago

Mostly, yes, as soon as a chunk in the RRB tree goes empty, it's deallocated, but there are always four head/tail chunks kept around for optimisation purposes once a vector grows beyond its first chunk.

I don't know, a piece table optimises for very different things than an RRB tree, I think it would require a very specific editing pattern to be efficient as something that looks like a vector, though you can certainly implement it the way you describe. From what you describe of your own usage pattern, honestly, it sounds like an RRB vector might be the more efficient choice, as it's very good at constructing new data sets and at random lookups, especially adjacent lookups, but it's hard to say without knowing the details if that's what you need.

bodil / im-rs

How to model tree shaped data with im-rs properties? #99