fsprojects / FSharpx.Collections

FSharpx.Collections is a collection of datastructures for use with F# and C#.
http://fsprojects.github.io/FSharpx.Collections/
Apache License 2.0
247 stars 78 forks source link

RRB Trees: Efficient Immutable Vectors #72

Open rmunn opened 7 years ago

rmunn commented 7 years ago

I propose to add an implementation of Bagwell and Rompf's RRB-Tree Vectors (Relaxed Radix-Balanced Trees) to FSharpx.Collections. They are similar to PersistentVector, but allow for efficient slicing and concatenation. Slicing an RRB vector is effectively constant-time since it takes O(log32 N) time. Concatenating two RRB Vectors together is also O(log32 N) but with a large constant multiplier, as much as 1024 in the worst case, so it should really be considered O(log N) for all practical purposes.

Clojure implementation of RRB vectors: https://github.com/clojure/core.rrb-vector Scala implementation: https://github.com/nicolasstucki/scala-rrb-vector

Papers about RRB vectors:

rmunn commented 7 years ago

I've been working on an implementation of RRB Trees (I call them RRBVectors in my implementation) in a private Bitbucket repo for a while now, and it's getting close to completion. I've started cleaning it up, removing duplicate code, and fleshing out the API to get it ready to add to the FSharpx.Collections. Once it's fully ready (which might take another couple months since I only have sporadic free time to work on it), I'll submit it as a PR. There are a few implementation details, though, that are worth discussing before the code is ready because they would require a few changes to PersistentVector:

  1. Since RRBVectors are a tweak of the PersistentVector implementation, I've implemented RRBVector nodes as inheriting from PersistentVector nodes. This will allow O(1) conversion of any PersistentVector to an RRBVector, but the "cost" is that the PersistentVector class needs to acquire some internal properties for its root, shift, tail and tailOffset properties so that the Persistent->RRB conversion code can access them. Without said access, the RRBVector code could only convert a PersistentVector in O(N) time, losing pretty much all benefit.

  2. As I've been developing the RRBVector implementation, I've kept it in the FSharpx.Collections.Experimental namespace. However, that required me to add an InternalsVisibleTo in FSharpx.Collections, making its internals visible to FSharpx.Collections.Experimental in order to have access to the PersistentVector's root, tail, shift and tailOffset properties. In the long run, it will be better to have RRBVector in the FSharpx.Collections namespace, so I've been doing extensive unit testing with FsCheck (and Expecto) to make sure it has no flaws. I am now turning up flaws at the rate of about one per eight-hour stress test (Expecto is great!), so it's close to ready, but not quite yet. However, there will need to be a decision about whether the RRBVector data structure can go into FSharpx.Collections directly (in which case no additional InternalsVisibleTo attribute will be needed) or if it will need to spend some time in FSharpx.Collections.Experimental first to get hammered on by real code. If the decision is to keep it in FSharpx.Collections.Experimental for a little while, then the additional InternalsVisibleTo attribute will need to be added in FSharpx.Collections.

rmunn commented 7 years ago

I want to clean up the code some more before I submit a PR, but anyone interested in seeing my first pass at an implementation can start looking at https://github.com/rmunn/FSharpx.Collections/tree/rrb-vector. A couple of concerns I need to discuss with other people before merging this in:

But I'm ready for other people to look through my code so far and review it. The main code is found in https://github.com/rmunn/FSharpx.Collections/blob/rrb-vector/src/FSharpx.Collections.Experimental/RRBVector.fs and the main tests are found in https://github.com/rmunn/FSharpx.Collections/blob/rrb-vector/tests/ExpectoRRB/RRBVectorExpectoTest.fs.

gsomix commented 6 years ago

@rmunn Is there any way I can help you? Would be great to merge it someday. :)

rmunn commented 6 years ago

@gsomix Pinging me was a good reminder. :-) But also, if you can do anything to get #87 merged that would be a big help. I need master to build before I can make any more progress on getting the RRB Vector merged.

forki commented 6 years ago

merged #87

rmunn commented 6 years ago

Thanks @forki! I'll see what I can do about getting the RRB vector code into a mergeable state now.

gsomix commented 6 years ago

Just a friendly reminder. ;)

jackfoxy commented 6 years ago

@rmunn are you still interested in submitting a PR to Collections.Experimental? I'm hoping to get a 2.0 release out in the next few weeks.

rmunn commented 6 years ago

@jackfoxy I am still interested, but I've had practically zero time to work on coding the past couple of months. I do have the code into a better state than it was back in March, but I think it needs more work before it's mergeable. OTOH, the right answer to "I have zero time" is probably to submit a PR in its current state, and let other people do any remaining work that's lacking. The two biggest issues with the code as I have it right now are:

1) There's almost no XML documentation, and I think putting that in would be a good idea before a release.

2) The Expecto tests I wrote require FParsec as a dependency: I wrote a custom DSL for expressing the shape of a given RRB tree succinctly, and it's proven quite difficult to refactor my tests to get rid of the FParsec dependency. OTOH, it's only a build-time dependency and there's no FParsec dependency in the library code itself, so perhaps adding an FParsec build-time dependency to FSharpx.Collections is acceptable?

If an FParsec build-time dependency is acceptable, then my code is already mergeable and I can get a PR in the next few days.

jackfoxy commented 6 years ago

@rmunn I think this is all fine for submission to Collections.Experimental, which is after all "experimental".

rmunn commented 6 years ago

@jackfoxy I've had a little bit more time to work on RRBVector, and discovered that my code isn't quite ready to release yet: my most recent change (implementing transients so that things like RRBVector.ofSeq will be efficient) still has a couple of bugs that I have to hammer down. (Which I would never have found if it weren't for FsCheck: one bug requires you to create a vector with at least 1057 items in it via RRBVector.ofSeq, then pop the list down to 1024 items or fewer, then remove an item from somewhere in the front of the list, at which point it throws an exception. I would never have found that buggy sequence of operations without FsCheck's model-based testing). I don't want to release this code until I know it won't lose people's data or throw exceptions on perfectly valid operations. I also don't want to delay the FSharpx.Collections release waiting on me, so here's what I plan to do:

Please don't hold up the FSharpx.Collections release on my account, since it might be another month or two before I truly have the code ready to go. (Without going into details of my personal life, free time is a vanishingly rare commodity for me right now, and my estimate of "a PR in the next few days" was wildly optimistic). But for the sake of @gsomix and anyone else waiting to see the code, I'll definitely get it up in its own repo and start releasing NuGet packages of the work-in-progress.

jackfoxy commented 6 years ago

@rmunn No worries. At a minimum I want to document all the main Collections data structures and fix know issues in the main Collection before releasing, and I haven't touched it in a couple of weeks.

realvictorprm commented 5 years ago

Friendly ping here :)

rmunn commented 5 years ago

I'm getting rather close to having something releasable. Right now I'm polishing up my tests for 100% code coverage, to make sure I don't have any lurking bugs in this rather complex piece of code. (I thought I had that licked last month, but just last week I found and fixed more bugs). I'll post another comment in this thread when it's ready for alpha testing.

rmunn commented 4 years ago

After far more bugfixing than I had anticipated, I've finally got the last bug (that I know of) licked, so I'm ready to make my RRBVector implementation public! To enable me to more easily respond quickly to bug reports and feature requests, I've created a separate GitHub project named Ficus and created a v0.0.1 NuGet package for my implementation. Anyone interested in alpha-testing my 0.0.1 release, please submit bug reports and/or suggestions for API improvements in the Ficus issue tracker.

Once I'm confident that the code is not only bug-free, but also has an API that's useful to people, I intend to submit it as a PR to FSharpx.Collections. My plan moving forward is to use Ficus as a sort of "staging" repo for my work on the RRBVector implementation, and submit any major improvements to FSharpx.Collections. Basically, SemVer patch releases will stay in the Ficus project, but any releases where I bump the SemVer minor version number (and certainly any releases where I bump the major version number) I'll also submit as a PR to FSharpx.Collections.