Kotlin / kotlinx.collections.immutable

Immutable persistent collections for Kotlin

Apache License 2.0

1.15k stars 59 forks source link

Builders returning Mutable variants is questionable #4

Open andrewoma opened 8 years ago

andrewoma commented 8 years ago

e.g. For ImmutableLists, there's no implementation that I know of that supports random insertion of elements. By exposing such methods in builders it will lull users into believing the operations do not involve copying the entire list.

I'm not sure Builders are really required at all if the implementations are chosen carefully:

Scala's Vector has efficient append/prepend
Scala's Map/Set implementations also support efficient modifications

ilya-g commented 8 years ago

I didn't follow, how comes that choosing the implementations carefully could render builders unnecessary?

Did you mean that if we choose particular implementations we wouldn't be able to make some builder operations memory efficient?

GlenKPeterson commented 8 years ago

@andrewoma I always assumed that Scala's Vector supported random insertion, thinking it was based on the Phil Bagwell/Tiark Rompf paper: https://infoscience.epfl.ch/record/169879/files/RMTrees.pdf

I've been working on my own implementation of this paper. It is not ready for prime-time - there is a TO-DO list in the comments at the beginning. But I believe it matches the paper and what's implemented seems to mostly work (~90% test coverage). Specifically I just implemented random inserts: https://github.com/GlenKPeterson/UncleJim/blob/2016-05-22_RRB-Tree/src/main/java/org/organicdesign/fp/experimental/RrbTree1.java

@ilya-g There will always be some situations where Mutable collections or Unmodifiable collections with Builders are more efficient, but if the efficiency gain is small enough, most programmers can make do with just Persistent collections. To me, the success of Clojure proves this.

Instead of choosing between builders and mutable/immutable collections, I'd rather see programmers ask, "Which collection (Map/Set/List/SortedMap/SortedSet) has the most appropriate Big O characteristics for what I'm doing?" Answering that question right yields a much bigger performance gain than choosing between Mutable, Unmodifiable with a Builder, and Immutable/Persistent.

For the 1 in 100 case where the slight speed increase of a mutable collection would matter, you can tackle that later as a separate problem. Running meaningful performance tests and simulating realistic loads takes time. I'd like to save that work until after the system has been running for a while.

P.S. Looping is a similar unnecessary distraction. Not thinking about loops yields similar benefits to not-thinking about mutability: https://github.com/GlenKPeterson/UncleJim/wiki/Looping-vs.-Functional-Transformations

andrewoma commented 8 years ago

Sorry @ilya-g, I missed your questions.

I didn't follow, how comes that choosing the implementations carefully could render builders unnecessary?

If using equivalents to Scala collections, builders do not add any value for Maps, Sets, SortedSet and SortedMaps that I'm aware of - all the data structures support incremental operations.

Likewise, Vectors (Lists) support efficient appending and prepending.

There may however be some efficiency to be gained by not "boxing" the underlying structures while building.

Did you mean that if we choose particular implementations we wouldn't be able to make some builder operations memory efficient?

Yes, the implementations mentioned above already efficiently handle incremental building. Optimising small collections would be an excellent optimisation, but this can be done via the factory functions.

I guess my main point is that I think it's dubious to expose the Mutable interfaces as builder interfaces as they allow operations other than add. It gives the user the false impression that operations other than add will be optimised when in all likelihood they can't be.

andrewoma commented 8 years ago

@GlenKPeterson I'm pretty sure Scala's Vector doesn't support random insertion. From memory, it uses fixed arrays so it can calculate the exact location of an element without having to search.

ilya-g commented 8 years ago

@andrewoma,

If using equivalents to Scala collections, builders do not add any value for Maps, Sets, SortedSet and SortedMaps that I'm aware of - all the data structures support incremental operations.

The proposed immutable collections also provide incremental add, put, set, etc operations. Builders expose all these operations as operations of mutable interfaces, thus providing the way to use immutable collections as following: https://github.com/mono/roslyn/blob/f4ef9144d8357f7af6a72992333ef162b5f1d552/src/VisualStudio/Core/Def/Implementation/Library/ObjectBrowser/AbstractObjectBrowserLibraryManager_Search.cs#L49, and it indeed does add some value.

There may however be some efficiency to be gained by not "boxing" the underlying structures while building.

Yes, that's the property we're aiming to achieve.

It gives the user the false impression that operations other than add will be optimised when in all likelihood they can't be.

For example, java.util.ArrayList has add operation with O(1) complexity. Does it give the false impression about remove operation's O(n) complexity to the user?

ilya-g commented 8 years ago

There will always be some situations where Mutable collections or Unmodifiable collections with Builders are more efficient, but if the efficiency gain is small enough, most programmers can make do with just Persistent collections.

@GlenKPeterson, did I get you right, that you're convincing us not to provide unmodifiable collections with builders? — because we have no plans to do so, even if it's possible to create such an implementation for the proposed interfaces. Instead we're going to provide persistent collections with builders.

andrewoma commented 8 years ago

For example, java.util.ArrayList has add operation with O(1) complexity. Does it give the false impression about remove operation's O(n) complexity to the user?

@ilya-g Fair point. I associate Builder with adding but I guess it's fine to use it to mean "mutate at your own risk"

GlenKPeterson commented 8 years ago

@andrewoma My personal experiments with optimizing for small collections didn't work. Others have come to the same conclusions - (thanks @pniederw for the link!). Others may succeed where we have failed, but if there were low hanging fruit there, I think Rich Hickey would have found it before any of us.

@ilya-g I don't like the idea of builders for Persistant/Immutable collections. A PersistentVector or RRB-Tree's main strength is that it's easy to reason about when you use it. If you start adding builders and other moving parts, you make it harder to use. If it's 5x faster for being 50% harder to use, that's worthwhile. But if a builder is only 20% faster, that's not sufficient justification for the added complexity.

Some Context:

Builders for Unmodifiable collections are a great idea! Kotlin and Guava both provide this already.
For maximum speed in a single thread, an ArrayList (or if the length is fixed, a primitive array) is still unbeaten.
To share something mutable between threads without a messaging system, Java's ConcurrentQueue and ConcurrentLinkedDeque are the preferred data structures. The articles I've read about the spin-locks and other optimizations sound extremely impressive.

I'm against adding a builder purely as an optimization for any of those purposes.

As an optimization for building an immutable object, a builder must quickly/efficiently produce that immutable object. Rich Hickey already provided a Mutable version of his PersistentVector and it's about 5x faster! That's in UncleJim too, but I didn't expose it to the public (yet) because I think we can improve the tail or focus enough that a builder would be less than 20% faster than the persistent data structure. Here is the theory behind that statement: https://github.com/GlenKPeterson/UncleJim/wiki/Focus-Optimization-in-an-RRB-Tree-(or-tail-optimization-in-a-PersistentVector)

ilya-g commented 8 years ago

If it's 5x faster for being 50% harder to use

@GlenKPeterson, what do you expect would be harder with builders than without them?

I'm against adding a builder purely as an optimization for any of those purposes.

We'd like to have builders for two purposes (and neither of them is in your list):

to utilize existing API taking mutable collections;
if we succeed in implementing freezable data structures for immutable collections, builders would provide the way to batch modification operations without freezing the collection.

GlenKPeterson commented 8 years ago

Freezing

"if we succeed in implementing freezable data structures for immutable collections, builders would provide the way to batch modification operations without freezing the collection."

Yes. And here's an example to prove it: https://github.com/GlenKPeterson/UncleJim/commit/3e6a2190e7ade35777fb69417b34bdfc9a6b0071#diff-769a18dc57177ada01ccb15d1b672f2bL11

It was a roughly 5x performance improvement at all sizes of vector: https://github.com/GlenKPeterson/UncleJim/issues/4#issuecomment-220801534

But I'm waiting to expose the mutable vector because I think I can beat this performance gap with the RRB-Tree (or with a similar modification to the PersistentVector - link in previous comment). Maybe the Scala RRB-Tree already beats it? I have no idea. Anyway, that's weeks or a few months away at my current rate of progress.

I don't understand the PersistentHashMap well enough (especially when there are hashcode collisions) to know if a mutable builder could be faster. That would be an interesting and potentially fruitful area of research! I don't see how the PersistentSortedMap's binary tree structure could benefit from a builder. The Sets are implemented using the Maps (at least in Clojure and Java). So really, only the Vector has a known-speedy builder implementation at this point (known to me anyway). It's a popular collection, but still something to keep in mind.

Mutators are More to Think About

The above example cuts both ways. It proves your point that mutable builders for Immutable collections can be a lot faster (at least with the PersistentVector). But it also proves my point that it's more to think about because I forgot to make this optimization for probably 2 years before someone brought it to my attention! I could not have made such an error if the PersistentVector was almost as fast as the Mutable one. That's why I'm pursuing the idea that this performance gap could be closed, making the builder unnecessary.

No-one is taking java.util.ArrayList away, just providing alternatives for the 99% of the time when it's more important to avoid bugs and get your job done than it is to optimize the heck out of some small part of your code.

When I program in Clojure, I don't worry about builders and freezing collections. I don't worry about what's mutable and what's not. I just use collections with the assumption that everything is immutable. To me, that feels simpler. It lets me focus on the problem I'm trying to solve, rather than on how to use the collections.

Just yesterday, I used Kotlin's listOf(), then later wanted to just add one more thing to the list on the fly in certain situations. I did not want to cause a copy of the entire collection just to add an item. Nor did I want to expose a mutable version of that list to the rest of my program. Hence, I used vec() from UncleJim/Clojure instead and wrote: val modifiedVec = myVec.append(newItem). I'm guessing that the performance cost for using an Immutable/Persistent vector is 15% worse than using a Mutable one and 2x faster than using an Unmodifiable one (because of the copy).

Existing API

utilize existing API taking mutable collections

Have you considered making sub-interfaces of Java's Collections API's and deprecating the mutation methods? I'm quite happy with how that worked out (the blue boxes in this diagram): https://github.com/GlenKPeterson/UncleJim/blob/master/inheritanceHierarchy.pdf

I mean, I'd rather see OpenJDK provide these interfaces: http://programmers.stackexchange.com/questions/221762/why-doesnt-java-8-include-immutable-collections But until then, sub-interfaces with deprecated methods seems the best we can do.

ilya-g commented 8 years ago

Before we continue the discussion may I ensure we have a common understanding of what the proposed builders are and what they are not?

Builders are not mutators. They do not provide a way to mutate an immutable collection (otherwise it hardly can be called immutable). An immutable collection keeps its content unchanged under any circumstances.

Instead builders are an approach to build a new immutable collection instance from an existing one. Let me illustrate it with an example:

// you have an immutable collection
val list = immutableListOf(1, 2, 3)
// you can get modified collection by invoking 
// operations directly on immutable collection
var list2 = list.add(4)
// and again
list2 = list2.add(5)

// or you can use a builder
val listBuilder = list.builder()
// now listBuilder is a MutableList with the contents of list
listBuilder.add(4)
listBuilder.add(5)
val list3 = listBuilder.build()
// list3 is a new immutable collection with 5 elements and
// list is still unchanged holding its three elements

Given the above, are those arguments against builders still valid?

GlenKPeterson commented 8 years ago

I read your post twice and tried to concentrate on it each time. Yes. As far as I can tell, this is exactly what I pictured when I made my arguments.

Rich Hickey's solution was to do just what you describe because the 5x performance difference was worth exposing a mutable collection. Seems reasonable.

Maybe you can envision a new kind of Unmodifiable (not Immutable) collection that you can convert back and forth to a Mutable builder at will, that only copies the changed portions to the builder and yields the performance of the ArrayList (or better) when mutable, yet all Unmodifiable references remain safely unchanged. If using functional transforms and the mutable version is really fast, this could all but make Persistent/Immutable collections obsolete.

sigh

I guess I became a little over-excited about my idea (you know what I mean?) for making the Immutable collection fast enough that the builder isn't necessary. I haven't proven it yet, so it's all just talk at this point. I'll write back when I have an implementation that proves or disproves the theory.

I still think it's worth timing the difference between your builder and persistent collection implementations before deciding that the builder is worthwhile. Advances in the PersistentVector or RRB-Tree could close that gap. Maybe you'll be the first one to implement such an advance? But today, the gap is still there.

One other idea to throw out there is that the persistent List, HashMap, and HashSet rely on copying all the items from one array into a bigger array for every added item. In Java, the JVM takes time to zero-out the new array before overwriting all those zeros with the items from the old array, plus one new item. Looking at sun.misc.Unsafe, I think it may be possible to implement a <T> T[] copyArrayWithItemInserted(T[] oldArray, T newItem, int insertIndex) that does not waste time writing zeros to the new array. I think the potential is there for a 45% performance improvement in vector appends if an Unsafe version turns out to be almost twice as fast.

Generic arrays are just nasty to work with. Unsafe is a beast. The two together are going to be a challenge. That's probably the very last thing I would work with, but it's another potential route to The Fastest Persistent Collections on the JVM. I'd love nothing more than if someone (not me!) did this! A native implementation as part of java.util.Arrays would be even better! I think this bug is for that: https://bugs.openjdk.java.net/browse/JDK-8146828

ilya-g commented 8 years ago

I still think it's worth timing the difference between your builder and persistent collection implementations before deciding that the builder is worthwhile.

Why do you think builders and persistent collections are mutually exclusive options?

Currently builders are implemented on top of the persistent immutable collections (see the proposed prototypes). They do not add value in terms of performance, but they do add some value in terms of convenience, so that you can use builder where mutable collection is expected.

We're aiming to provide better performance for builders but without sacrificing persistence and immutability of the collections.

GlenKPeterson commented 8 years ago

use a builder where mutable collection is expected.

That link was really critical too. Thanks for taking the time to explain this to me politely. I'm not sure I would have understood it without our discussion, simply because I would not have expected it.

Hiding an immutable collection inside a mutable java.util.List wrapper is something I had never considered - an interesting idea. My (overly?) simple line of reasoning was: Immutable = slow and safe; Mutable = fast and dangerous. Therefore, this Mutable wrapper for an Immutable collection seems like it would be both slow and dangerous - the worst of both worlds.

I'm imagining you have a use case where:

You have an immutable collection
You want to share a mutable version
You don't want to see the updates made to the mutable version
You can't afford to make a mutable copy of your collection
It's OK for appends to be slow in the mutable version

The nearest situation I can imagine, I would either use a Mutable collection (because I'd want everyone to see all updates), or I'd make a mutable copy and pass that around (I'd rather have the one-time cost of the deep copy than have all mutations take place at Immutable speed).

Do you have a real-world example of when this would be useful? I think that's what I'm missing.

ilya-g commented 8 years ago

I've had a look again on UncleJim collections hierarchy recently, and now I can conclude that builders and transient collections have very similar use cases. The main difference between them is that builders unlike transients do not pretend they are immutable (ie. do not implement ImmutableCollection, but rather implement mutable one). This makes them hard to misuse and easy to reason about.

Interesting that I didn't see such hierarchy (Transient implementing Immutable) neither in the original clojure collections, nor in their another typed rewrite pure4j

GlenKPeterson commented 8 years ago

Yes, transient collections are essentially builders. You can prevent some errors by making the builder a type which is incompatible with the type of collection that it builds.

Mostly because of this conversation, UncleJim/Paguro version 2.0.7 exposes the mutable versions of the 3 immutable collections. They are helpful if you want to make your own hybrid collections or transforms based on Paguro. If you really want to make a loop and put items into a collection one at a time, builders/mutables would be good for that, but I wrote Paguro so that I don't have to write that kind of code.

// ImList:
MutableList mutable();

// ImUnsortedMap
MutableUnsortedMap mutable();

// ImUnsortedSet
MutableUnsortedSet mutable();

The mutable interfaces all have an immutable() method to send them back.

When I timed the pCollections vector, Paguro was about 1.4 to 7 times faster (faster as it grows). The mutable vector was an additional 5 times faster than that.

GlenKPeterson commented 6 years ago

Updates:

The TransientVector provides a 5-10x performance improvement over the PersistentVector, which makes building it roughly equal to java.util.ArrayList. That much of a performance gain is too significant to ignore.
My idea about tail optimization did not pan out.
Most of the JVM Persistent/Immutable collections I've looked at seem to use copy-on-write with Arrays, so many things would potentially become faster if copyArrayWithItemInserted() and copyArrayWithItemRemoved() were sped up. I implementd copyArrayWithItemInserted() using Unsafe as I suggested in a previous post, but no mater what I did, it was significantly slower. If it would even make a diffreence, I think such a method would have to be implemented natively, as part of the JVM or java.lang.System.
What I did before (Transient implementing Immutable) was evil, bad, and wrong. I've since made "Base" interfaces that the Mutable and Immutable collection interfaces both extend.
I've renamed UncleJim to Paguro and renamed a bunch of interfaces so that they don't conflict with Kotlin's stdlib intrerfaces in Version 3.1 which is only alpha because I'm deciding how to best tell people to convert their legacy code - my usual technique of a bash script wrecks havoc on any Kotlin code in the project.
Paguro's RRB Tree implementation is complete. It turns out to largely "convert back and forth to a Mutable builder at will" in a way that leaves any Immutable versions safely Immutable. I didn't know that was possible and it happened by accident when I was trying to keep things simple.
@andrewoma Paguro's RRB Tree supports random insertion of elements and both it's mutable and immutable versions are faster than java.util.ArrayList for large lists. This was sort of a happy accident of how I implemented it. I think most people who need this functionality would prefer to use a Map because a Map can have gaps between indices so you don't have to worry if you've exceeded the size of the List you've built so far. So I think the essence of your point still stands - if you want that functionality, a Map is probably a better choice of collection. But it's also a little undermined by a surprisingly efficient implementation in the corner case where the Vector would work too.

I appreciate that this discussion so politely and gently showed me the error of my ways. You were right and I was wrong on several counts.

If you're receptive, I might experiment with making my RRB Tree replace PList in your project. It's Apache-licensed, works, and log base 32.