dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.16k stars 4.71k forks source link

Immutable*<T> classes do not perform well #68449

Closed mrpmorris closed 2 years ago

mrpmorris commented 2 years ago

Looking through the source for ImmutableArray<T> it seems it is more of a read-only array rather than a proper immutable class.

The difference being that read-only is designed to remain static and unchanged, whereas immutable is designed to be unchanged but with the ability to very quickly make a copy of the object with slight variations.

Immutable<T> array copies every element on change, which doesn't perform well at all.

A simple benchmark comparing ImmutableArray<T> to the Lst class in LanguageExt, and to a simple (incomplete) class I wrote that stores the elements in a tree shows the inefficiency of the class. This benchmark was based on adding 100_000 integers.

image

Could the framework use some established "Persistent data structure" algorithms instead please?

https://en.wikipedia.org/wiki/Persistent_data_structure

ghost commented 2 years ago

Tagging subscribers to this area: @dotnet/area-system-collections See info in area-owners.md if you want to be subscribed.

Issue Details
Looking through the source for `ImmutableArray` it seems it is more of a read-only array rather than a proper immutable class. The difference being that read-only is designed to remain static and unchanged, whereas immutable is designed to be unchanged but with the ability to very quickly make a copy of the object with slight variations. `Immutable` array copies every element on change, which doesn't perform well at all. A simple benchmark comparing `ImmutableArray` to the `Lst` class in `LanguageExt`, and to a simple (incomplete) class I wrote that stores the elements in a tree shows the inefficiency of the class. ![image](https://user-images.githubusercontent.com/3111981/164947470-2b575b37-0157-4192-b3fd-54862b171cef.png) Could the framework use some established "Persistent data structure" algorithms instead please? https://en.wikipedia.org/wiki/Persistent_data_structure
Author: mrpmorris
Assignees: -
Labels: `area-System.Collections`, `tenet-performance`, `untriaged`
Milestone: -
huoyaoyuan commented 2 years ago

ImmutableArray<T> is more like a workaround of type system. It's a slim wrapper of array that disallows writing. It's performance goal is to be as fast as array for read operations.

whereas immutable is designed to be unchanged but with the ability to very quickly make a copy of the object with slight variations.

This would require fundamental change of data structure. ImmutableList<T> is more friendly for such usage.

mrpmorris commented 2 years ago

Yes, it would need to be written better. At the moment it is written how I would expect a ReadOnly, not an Immutable.

There are specific data structures / algorithms for immutable classes and .net isn't using them. I am asking for a fundamental change of the inner workings, not the interface.

333fred commented 2 years ago

whereas immutable is designed to be unchanged but with the ability to very quickly make a copy of the object with slight variations.

No, this is very specifically not what ImmutableArray is designed for. That type is designed for scenarios where data is built up once with a builder then frozen, never to be added to again. The immutable types were first prototyped by the Roslyn compiler, which uses this pattern all over the place to build up semantic info then share it with confidence across multiple threads. The performance of read operations is much, much more important for this type than the performance of add operations. If you care about add operations and are willing to sacrifice read, use a different data structure, such as ImmutableList, which has a different set of priorities.

mrpmorris commented 2 years ago

Your comment was interesting and informative, thank you.

My comment was about how immutable algorithms (aka persistent data structures) were designed, which the .net class is following in name only. Roslyn could have used a ReadOnly collection if no derivatives were required.

I will try out ImmutableList to see how the performance compares. Although it makes little sense if one Immutable* class satisfies industry required standards and another does not.

PS: In case you'd like to consider changing it for future readers, your opening sentence can be misread as being a bit confrontational and condescending.

svick commented 2 years ago

@mrpmorris

immutable algorithms (aka persistent data structures)

Those are not the same thing. ImmutableArray<T> is an immutable, but not persistent list. ImmutableList<T> is an immutable and persistent list. The naming might be confusing, but both data structures are useful in practice.

Roslyn could have used a ReadOnly collection if no derivatives were required.

The problem is that a ReadOnlyCollection<T> does not guarantee thread-safety. It means that I can't modify it, but someone else could, so I can't rely on it not changing. That's why Roslyn needed truly immutable (but not necessarily persistent) data structures for its high-performance multithreaded code.

333fred commented 2 years ago

Although it makes little sense if one Immutable* class satisfies industry required standards and another does not.

I'm honestly not certain where these standards are coming from. For example, the Wikipedia article you linked describes persistent data structures as being "effectively immutable", where immutable links to this article, that simply describes immutable objects as "an object whose state cannot be modified after it is created." That is exactly what ImmutableArray is: an array that cannot be modified after creation.

Roslyn could have used a ReadOnly collection if no derivatives were required.

As Svick mentioned, read only collections are not the same as immutable collections. A read only collection could be modified by another thread that has a mutable view, shearing state in the middle of a loop. An immutable collection guarantees this cannot happen, as there is no such thing as a mutable view.

333fred commented 2 years ago

I'm closing this out, as the ImmutableArray class has the performance characteristics that it was designed to have, trading O(n) add efficiency for O(1) read efficiency.

mrpmorris commented 2 years ago

Why close it?

There are algorithms that give almost O(1) for add + remove + change + delete.

333fred commented 2 years ago

Almost O(1) is not O(1). As I said, read is the most important attribute of ImmutableArray. Any reduction from O(1) to O(log n) or some other non-constant will have significant detrimental impacts to the entire ecosystem, as it will significantly hurt the performance of the compiler, and by extension, every IDE or piece of tooling that uses it.

mrpmorris commented 2 years ago

If it's purpose is to be built once and then only be read from, why does it have an Add(T) method?

The Add(T) method is there to create a derivative. It's a very common Immutable pattern, one that has very well established algorithms, but .net doesn't use that pattern.

If Roslyn needs a write-once read-many array then it should definitely have one, perhaps called ReadOnlyArray<T> or something like that. It shouldn't be confused with Persistent Data Structures such as all the other Immutable*<T> which do use the established immutable/persistent patterns.

image

The problem is that we have

  1. Multiple Immutable*<T> classes
  2. They are built on Immutable/Persistent Structure principles
  3. They have Add/Remove etc methods on them

Then we have ImmutableArray<T> which shares the same name format, has the same derivation methods, but does not implement those principles.

As a consequence, I've been using (and recommending use of) ImmutableArray<T> for a while now, but only found out recently it has very poor performance compared to ImmutableList etc. There is nothing in the docs to indicate this is not a class that implements high performance immutable principles, every reason to assume it should, and the only way to know any different is for it to occur to someone to benchmark ImmutableArray<T> against ImmutableList<T> for no real reason other than that you happened to be looking in the source code and spotted this class with the same name does not follow the same principles.

I would recommend marking the derivation methods as [Obsolete] with a message saying to use ImmutableList<T> instead.

tfenise commented 2 years ago

There is nothing in the docs to indicate this is not a class that implements high performance immutable principles

https://docs.microsoft.com/en-us/dotnet/api/system.collections.immutable.immutablearray-1

There are different scenarios best for ImmutableArray and others best for ImmutableList.

Reasons to use immutable array: Updating the data is rare or the number of elements is quite small (less than 16 items) You need to be able to iterate over the data in performance critical sections You have many instances of immutable collections and you can't afford keeping the data in trees

Reasons to use immutable list: Updating the data is common or the number of elements isn't expected to be small Updating the collection is more performance critical than iterating the contents

The following table summarizes the performance characteristics of ImmutableArray

mrpmorris commented 2 years ago

I stand corrected on that my incorrect observation regarding the docs. I was looking at the class descriptions (EDIT: For the non-generic class names) which gave the impression they do the same thing, I didn't see the nuances further down in the Generic Class pages those entries linked to

I believe all of my points stand though.

333fred commented 2 years ago

As I mentioned earlier @mrpmorris, I'm not certain where your definition of immutable data structures is coming from. All that is required to be an immutable data structure is that the data can never change. A ReadOnlyX does not provide the same guarantee, and it can't be used to replace an immutable data structure.

mrpmorris commented 2 years ago

It's further down the page = https://en.wikipedia.org/wiki/Persistent_data_structure#Applications_of_persistent_data_structures

I think the problem is that you are talking in a mix of .net implementation jargon and plain English

ReadOnly: Is read-only by the consumer but not the producer, so not strictly read-only (.net definition) Immutable: Cannot be changed (English definition)

By strict definition, immutable classes just need not have Add/Set methods etc and they'd be immutable.

However, like Scala / Clojure, .net has derivative methods in them (Add/Remove etc). Scala / Clojure implement well established algorithms for ensuring creating derivatives is performant (Persistent Data Structures).

So, when we see the word "Immutable" we correctly infer "Cannot be changed", but when we see these derivative methods added to an immutable class (Add / Remove / etc) it is absolutely right that the consumer should assume those methods are highly performant using well established algorithms.

Which brings us to this scenario. Roslyn needed a class that was an array that it could guarantee would not be altered.

1: You couldn't call it ReadOnlyArray because it wouldn't behave in the same way as other ReadOnly classes 2: You called it ImmutableArray because that made sense

But, the error was that this class was given derivation methods! Any programmer should reasonably expect those methods to be performant due to the fact that this is not only how things are done in other languages, but also because other classes in .net with the same prefixes in their names do also.

This class would be fine called ImmutableArray if you couldn't create derivations (e.g. a FrozenArray or ArraySnapshot) - making it clear its purpose is create-once + read-many.

Adding these derivative methods has really muddied the waters.

As I said, I only spotted the difference because I was searching through the source code. Due to all of these factors it didn't even occur to me to ask the question:

This single class with the same prefix in its name, in the same assembly, in the same namespace, with the same derivate creation methods: Is it the the only one that doesn't behave in the same way as the others?

If those derivative factory methods didn't exist or were obsolete then it would be 100% obvious that this is not like the other Immutable* classes, but is in fact a frozen view of an array that even the producer cannot alter (not a suggested solution).

As it is, it is very easy for people to use the wrong class and never realise it.

I don't know what the actual solution to this is by the way, All I know is that I've been falling into this trap myself for a few months now (at least) and it needs to be easier to not fail.

tfenise commented 2 years ago

The title of https://en.wikipedia.org/wiki/Persistent_data_structure is "Persistent data structure", not "Immutable data structure".

ImmutableArray.Add method being O(n) is not terribly bad. It is still acceptable or even preferred in some cases for example when the array contains few elements. Even insertion sort has its place.

The naming of Immutable* or the inclusion of derivative methods might be confusing, but the names consist of only a few words and cannot convey the details anyway. That is what documentations for. Besides, the name ImmutableArray already hints that it may be backed by a plain .net array, and the coexistence of both ImmutableArray and ImmutableList already hints that one may need to do some research (in particular read the documentations) before deciding whether to use ImmutableArray or ImmutableList. In addition, ImmutableArray<T> is a struct, which also hints that this type may be somehow different.

You could propose a change to make it less confusing, but obsoleting ImmutableArray.Add is not a solution. ImmutableArray.Add is fine, as I said.

333fred commented 2 years ago

It's further down the page = https://en.wikipedia.org/wiki/Persistent_data_structure#Applications_of_persistent_data_structures

I don't see how this section specifies that immutability requires persistent data structures. It talks about how they can be used for immutable data structures, but doesn't mention anything about the requirements.

it is absolutely right that the consumer should assume those methods are highly performant using well established algorithms.

Performance cuts both ways, right? I might expect that ImmutableList's read performance is highly performant, but it isn't. It's O(log n). That isn't acceptable for my use case: I very rarely add, but I read all the time. Having a data structure that is optimized for that scenario means that it's perfectly fine to have an O(n) add method as long as I get O(1) read performance. As long as the performance of these methods is documented (which they are), it seems perfectly reasonable to me. Every data structure has its tradeoffs, and when selecting a data structure for an application it's important to fully research the performance characteristics of all potential structures and understand them before committing. For example, for the required members feature I very specifically looked for a data structure that uses persistent data, as I wanted to build up immutable dictionaries across type hierarchies, and memory space efficiency is important to me in this scenario.

mrpmorris commented 2 years ago

I've explained. It's the combination of immutable + methods that create variations that makes them Persistent Data Structures.

You seem to be missing the point that it is too easy to accidentally choose the wrong class for the various reasons I've outlined. Can you concentrate on that?

How can it be made obvious to a coder that the array implementation is slow for creating variations but fast for reads, as opposed to all the other immutable classes, which are designed to have slower reads in exchange for higher performance when creating variations?

tfenise commented 2 years ago

Unfortunately I don't think the situation, if it is really so confusing as you described, can be improved. The documentations are already clear. A breaking change like renaming or removing methods that create variations is probably unacceptable. Even putting a warning on the methods would be too much.

theodorzoulias commented 2 years ago

@mrpmorris given that the ImmutableList<T> and the ImmutableArray<T> are both useful collections with highly desirable characteristics for specific scenarios, and that their name does not reflect these characteristics and allows developers to fall unwillingly in the pit of failure by not choosing the right collection for the job, like you did unfortunately, what name would you give to these two collections if it was in your hand to choose their name?

333fred commented 2 years ago

You seem to be missing the point that it is too easy to accidentally choose the wrong class for the various reasons I've outlined. Can you concentrate on that?

I'm not sure what you expect anyone on the runtime to change though? The performance characteristics are documented, and the names of the types/methods available are certainly not going to change (irrespective of whether or not I think they are just fine as they are). To me, the difference seems rather obvious: an ImmutableArray will behave like an array, but immutable. Arrays have well-known performance characteristics, where indexing is O(1) and resizing is O(n). Meanwhile, ImmutableList is a list, meaning that I actually have to look at the docs to determine what its performance characteristics are, as I can't point to any specific "list" as the de-facto list implementation that ImmutableList should naturally follow (is it backed by an array, or a linked list, or some other data structure like an AVL tree?). I can certainly accept that perhaps I'm too familiar with these types to see that difference, but I also don't see what anyone would be expected to do about it.

333fred commented 2 years ago

I've explained. It's the combination of immutable + methods that create variations that makes them Persistent Data Structures.

I don't understand how the presence of an Add method required it to be a persistent data structure though. As I've said, the ImmutableArray type is designed with a specific set of perf goals. Just because Add is expected to be rarely used doesn't it mean it will never be used, and if functionality is useful, it should be included.

mrpmorris commented 2 years ago

@theodorzoulias I don't think it matters, nothing is going to change.

@333fred The presence of Add doesn't "require" it to be a persistent data structure, it's presence on an immutable collection makes it one.

image

If you can't think of how to improve this situation that others like myself will fall into, should you leave the ticket open for some discussion? Or maybe it requires a new ticket that focuses on the inference rather than the implementation?

Or perhaps I should just accept things the way they are and let other people fall into the same trap?

tfenise commented 2 years ago

As I see, the definition of a persistent data structure is just a data structure that always preserves the previous version when modified. It does not make performance guarantees like a modification should be faster than O(n) anyway.

mrpmorris commented 2 years ago

Yes, that is exactly it, and there are well-established patterns for doing so in a very efficient way. But that has become a moot point now anyway because I've discovered this is a case where variation creation time has been deliberately sacrificed to improve read performance.

theodorzoulias commented 2 years ago

I don't think it matters, nothing is going to change.

@mrpmorris surely the ship has sailed regarding the names, the functionality and the performance characteristics of these two collections. But you might be able to help at preventing similar accidents from happening again in the future.

mrpmorris commented 2 years ago

I don't think the ship has sailed. Major versions can introduce breaking changes, especially if the only breakage is a compiler warning.

Release 7: Mark as obsolete Release 8 or later: Remove (if possible)

I think what would be an intuitive idea would be to name the methods CloneAndAdd etc. The reasons being

  1. The fact that it's not the same name as the other immutable collections makes the user ask why it is named differently and prompts them to find out.
  2. The prefix CloneAnd hopefully makes it obvious it is performing a full copy.

In fact, I'd actually go further. I'd add the new methods to ImmutableArray.Builder instead, because

  1. With the absence of Add etc the user cannot possibly assume it is doing what the other classes are doing.
  2. Being on the Builder class instead makes it more obvious that you are creating something entirely new.

In the meantime, I'll try to remember to submit a PR for the docs so the ImmutableList and ImmutableArray (non generic) class descriptions make it obvious that the generic classes they create are for different purposes. I myself was looking at the docs for the non generic classes.

Should this be reopened now for consideration, or would it be better to open a new ticket with different details and a link to this discussion?

huoyaoyuan commented 2 years ago

Release 7: Mark as obsolete Release 8 or later: Remove

For binary compatibility to consume libraries from older version, .NET BCL never removed members that weren't marked as experimental.

mrpmorris commented 2 years ago

Then 7 only (edited comment), thanks.

tfenise commented 2 years ago

Those developer who know to use ImmutableArray correctly would have to change their code for the obsoleting.

huoyaoyuan commented 2 years ago

They do not even have a change to modify the code. The roslyn compiler uses ImmutableArray<T>.Add a lot of times, and targets the oldest supported framework for better availability. Any interface of existing types can't change, because the analyzers loaded by the compilers etc can and should target oldest framework too.

The users are familiar to the API shape and performance implications of ImmutableArray. They are not designed for any article on wikipedia or some textbook.

mrpmorris commented 2 years ago

@tfenise Yes, they change their code or add an ignore. @huoyaoyuan It's not about matching Wikipedia. Please read the discussion thoroughly, it's about being far too easy to fall into failure instead of success.

tfenise commented 2 years ago

Then one needs to figure out whether those who use it correctly are the majority, or those who fall into the trap are the majority.

huoyaoyuan commented 2 years ago

I do not see it's so easy to misuse it. There is a very common example for the pattern used by ImmutableArray: string. All the string modification methods are costful and returns a new string. In fact, the data structure and performance characteristics are nearly the same between them. The API shape that returns a new instance on modification implies that pattern.

huoyaoyuan commented 2 years ago

I think the problem is that you are talking in a mix of .net implementation jargon and plain English

Naming jargons do exist, and it's practically impossible to change them. For example, const in C++ is readonly in C#, and const in C# is constexpr in C++.

huoyaoyuan commented 2 years ago

Some of my understanding of the names:

ImmutableArray: No one can modify the content of it. In contrast, the IReadOnlyXXX interfaces only disallows modification by current consumer, but it may be modified elsewhere. The Array suffix implies it behaves (and performs) like an array.

Add method with a returning instance: The content of the returned instance with be the result of addition. The best name would be Append, to match LINQ.

I would really be OK if Add renamed to Append, but generally there weren't "rename a member to a better name" happen in .NET. It's better to ask the API review team for their considerations.

mrpmorris commented 2 years ago

There are more methods to consider, Remove, Set, etc.

It would be best to mark them as obsolete and add those to the Builder class instead, to make it 100% clear you are building a new one and not using a well established Persistent Data Structure approach that is easily inferred by the fact the class interfaces match that pattern, and the other classes implement that pattern.

Isn't this the correct place to propose such a thing? If so, can you tell me where I should be posting instead?

theodorzoulias commented 2 years ago

I think what would be an intuitive idea would be to name the methods CloneAndAdd etc.

@mrpmorris changing the name of the method Add to CloneAndAdd could help at addressing a different type of error. Some developers might ignore the return value of Add because they think that this method mutates the collection, which is a stupid mistake to make (how is it possible to mutate an immutable collection?), but nevertheless it could easily happen in conditions of stress, sleep/caffeine deprivation etc. But this change wouldn't address the error of assuming that the ImmutableArray<T>.CloneAndAdd method performs better that O(N), which is the error that we are discussing here.

Btw I would like to admit that I sympathize with your frustration. Myself I was shocked when I realized that using the get indexer of the ImmutableList<T> is not an O(1) operation!

mrpmorris commented 2 years ago

@theodorzoulias CloneAndAdd makes it clear the item is being fully cloned. Perhaps "CopyAndAdd" would be clearer? I think it would.

tfenise commented 2 years ago

Naming ImmutableArray.Add and ImmutableList.Add the same has an advantage, though. It makes switching from ImmutableList to ImmutableArray, or vice versa, easier. If one has been using ImmutableList but then figures out that ImmutableArray would be better in their situation, they only need to change ImmutableList to ImmutableArray without needing to change Add to CloneAndAdd. One thing I don't like is that Array.Length and List<T>.Count have different names.

mrpmorris commented 2 years ago

The point is, there should never be a need to switch because it should be so obvious from the start that ImmutableArray should nearly always be avoided when you need derivates of the value.

How quick it is to change once you realise you've got the wrong one isn't important.

tfenise commented 2 years ago

ImmutableArray should nearly always be avoided when you need derivates of the value.

I don't think so. It is fine when the collection is expected to be very small. This is not a rare corner case.

theodorzoulias commented 2 years ago

there should never be a need to switch because it should be so obvious from the start that ImmutableArray should nearly always be avoided when you need derivates of the value.

@mrpmorris here is a counterexample , where I started with an ImmutableList<T> for storing the subscribers of a pub-sub system, and later I switched to an ImmutableArray<T>. I am expecting that the number of subscribers will be small (typically only one), adding/removing subscribers will be rare, and enumerating the subscribers will be frequent. The ImmutableArray<T> is clearly preferable in this scenario.

mrpmorris commented 2 years ago

@theodorzoulias Yes, it is preferable. I am not disputing that. I am now utterly convinced there is a use for ImmutbleArray in its current form.