dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.97k stars 4.66k forks source link

Add List<T>.MoveToArray() method releasing ownership of inner buffer #27789

Open qbit86 opened 5 years ago

qbit86 commented 5 years ago

It would be convenient to have an efficient “array builder” for scenarios which require no redundant allocations and copies. Possible implementation is:

/// <summary>
/// Extracts the internal array and replaces it with a zero length array.
/// </summary>
/// <returns>An <see cref="T:T[]"/> that is <see cref="Capacity"/> in length.</returns>
/// <remarks>
/// This array is <see cref="Capacity"/> in length,
/// so the caller should save actual <see cref="Count"/> before calling this method.
/// </remarks>
public T[] MoveToArray()
{
    T[] result = _items;
    _size = 0;
    _items = s_emptyArray;
    _version++;
    return result;
}

Usage:

int count = list.Count;
var array = list.MoveToArray();
Debug.Assert(count <= array.Length);
Debug.Assert(list.Count == 0);

It is already implemented in Misnomer.Rist, but this API looks quite useful to be included in standard BCL.

danmoseley commented 5 years ago

Could you give examples of when this is useful? Ie when you need to progressively create an array, you don't know the length ahead of time, and the expanding behavior of List (doubling) is good enough? I can see how it's potential useful I just can't think of when I've needed this.

qbit86 commented 5 years ago

@danmosemsft

when you need to progressively create an array, you don't know the length ahead of time, and the expanding behavior of List (doubling) is good enough

It's an intersection of scenarios when you do need List<T> and need to interact with APIs that consume arrays or their segments/spans. I believe a lot of current List<T>.ToArray() usages can immediately benefit from replacing with List<T>.MoveToArray() (no allocating or copying).

var localList = new List<string>();
// Populating, filtering, sorting, grouping...
localList.Add("Camille");
localList.Add("Annie");
localList.Add("Sara");
localList.Add("Katrin");
localList.Add("Kari");

string[] array = localList.ToArray(); // Want to move, not to copy.
string grouping = string.Join(", ", array, 1, 3);
Console.WriteLine(grouping); // Annie, Sara, Katrin
drieseng commented 5 years ago

It would - for example - save quite some allocations in Microsoft.AspNetCore.Routing. Here are a few examples:

It would be great if this method were also be added to IList\.

qbit86 commented 5 years ago

@drieseng

It would be great if this method were also be added to IList<T>.

It's order of magnitude harder since it's breaking change. Such extending the scope of this issue probably requires separate thread for discussion.

Clockwork-Muse commented 5 years ago

Hrm.

The problem, I think, is that it's unusual to consume a list this way. If your API takes a List (as with an IEnumerable or similar), most callers are going to expect that their list stays the same

@drieseng 's examples are all instances of using List as temporary filtering storage management (one wonders if Linq would work for some of that, but there's some extra overhead there, and wouldn't work in all cases). In other words, you might as well have an ArrayBuilder type (for some reason I thought we had one....?), over fitting this onto list.

However, at the point where you'd find something like this useful... would it be better/more useful long term to have a MemoryBuilder/SpanBuilder type?

svick commented 5 years ago

@Clockwork-Muse

you might as well have an ArrayBuilder type (for some reason I thought we had one....?)

CoreFx does have one, but it's internal.

However, at the point where you'd find something like this useful... would it be better/more useful long term to have a MemoryBuilder/SpanBuilder type?

I would definitely prefer something that returned Memory<T> because I think the proposed MoveToArray is too easy to misuse: there is no indication that the returned array is only partially filled and that to use it correctly, you have to retrieve Count before calling MoveToArray. Memory<T> solves that.

danmoseley commented 5 years ago

I'm missing a piece here. Presumably your consuming API expects the array to be the correct size, but in the general case the List's array will have extra space. You have to set Capacity to the value of Count to cause List to trim the array, and at that point you've made the allocation/copy you want to avoid, except in the rare case it happens to be the correct size already. Can you clarify?

@svick you mention you have to retrieve Count before calling MoveToArray - could you clarify that, retrieving Count does not trim the array.

vcsjones commented 5 years ago

could you clarify that, retrieving Count does not trim the array.

I think the issue is that the API proposal also resets the internal array back to an initial value. This method essentially clears out the list. So if you want to know exactly how many items are in the array (since the array's capacity is often bigger than the actual Count) you need to get the real count before it gets cleared out by this.

I'm leaning toward something simpler like an ArrayBuilder type.

svick commented 5 years ago

@danmosemsft Calling Count, doesn't trim anything, but MoveToArray does reset the List's Count.

This means that e.g. if I did want to create Memory<T> from the API proposed here, return new Memory<T>(list.MoveToArray(), 0, list.Count); would not work. It calls Count after MoveToArray(), so it would always return an empty Memory<T>. Instead, you would have to write:

var count = list.Count;
return new Memory<T>(list.MoveToArray(), 0, count);

That's one reason why I'm saying the API proposed here would be too easy to use incorrectly.

danmoseley commented 5 years ago

@svick, I see, but that is a rare API that takes an array + the count to consume from it. Most .NET API of course just take the array therefore could not benefit from this efficient path, if I understand right.

Your point about incorrect use is good too.

Clockwork-Muse commented 5 years ago

...I was going to point out earlier that, if we did this, we should be returning ArraySegment so that you didn't have to retrieve the count separately. I just was more curious if a different builder type would be better.

qbit86 commented 5 years ago

@Clockwork-Muse

If your API takes a List

...Then it violates Framework Design Guidelines :)

most callers are going to expect that their list stays the same

It's not unusual to see methods like Populate(SomeCollection<T> collection) which modifiy passed collection.

@danmosemsft

Presumably your consuming API expects the array to be the correct size

BCL APIs that consume arrays usually have overloads taking offset and count not relying on array length, like String.Join() in example above.

but in the general case the List's array will have extra space

I mentioned that in <remarks> and usage section in topmost message.

Can you clarify?

You need to save actual size before releasing ownership over underlying array. For example:

int count = list.Count;
int capacity = list.Capacity;
string[] array = list.MoveToArray();
Debug.Assert(capacity == array.Length);
Debug.Assert(count <= array.Length);
ReadOnlySpan<string> span = array.AsSpan(0, count);
qbit86 commented 5 years ago

@svick

I think the proposed MoveToArray is too easy to misuse: there is no indication that the returned array is only partially filled and that to use it correctly

Probably it's naming issue. I was inspired by ImmutableArray<T>.Builder.MoveToImmutable(). But can call it Release() like in std::unique_ptr<T>.

I would definitely prefer something that returned Memory<T>

It's too abstract; and not that much existing APIs consume Memory<T>, but a lot of do consume arrays or their segments.

What I'm proposing is not adding some new high level feature, but rather revealing long time existing but hidden functionality. Missing natural core mechanics of concrete class, that should be available since .NET 2.0. It is not about IList<T> and Memory<T>, but rather List<T> and T[]. Data structures, not abstractions. It is barely more destructive or dangerous than list.Clear(); list.TrimExcess();. Think of Release() as some advanced Clear() with optional reuse of abandoned buffer.

danmoseley commented 5 years ago

cc @stephentoub for thoughts.

Clockwork-Muse commented 5 years ago

@qbit86 - Hey, I'd look askance at a library taking a List as a parameter too. What I was really getting at is that you aren't actually interested in List, what you really want is a growable collection with array-like semantics.

As for Populate(...)... most of those are going to be explicit about modifying the parameter handed to them. A List passed to a method that calls this are logically read-only parameters, so the fact they get cleared would be really strange.

Which is why I think I'd prefer a non-List builder type.

svick commented 5 years ago

@qbit86

Probably it's naming issue.

It's not a naming issue for me. The issue is that you're returning the whole array, when what you logically want to return is only the valid part of the array.

It's too abstract; and not that much existing APIs consume Memory, but a lot of do consume arrays or their segments.

That's a good point. It's possible to get array from Memory<T>, but it's not exactly obvious. ArraySegment<T> is probably a better choice:

Think of Release() as some advanced Clear() with optional reuse of abandoned buffer.

If that's the case, then I think the returned array should be zeroed-out. But then that wouldn't work for your string.Join example.

qbit86 commented 5 years ago

@Clockwork-Muse

What I was really getting at is that you aren't actually interested in List, what you really want is a growable collection with array-like semantics.

It would be nice to have a clean minimalistic ArrayBuilder<T> too. But in reality List<T> is default and the most widely adopted collection over existing codebases. There are already a lot of lists here and there with all those enumerable.ToList() calls, in-place sorting and filtering. Adding yet another layer for creating ArrayBuilder<T> from existing instance of concrete List<T> then moving it to array... I don't know. I suppose that revealing single missing functionality at existing List<T> is useful regardless of having dedicated ArrayBuilder<T>. It simply belongs to List<T>. Sorry for my confused English :)

As for Populate(...)... most of those are going to be explicit about modifying the parameter handed to them. A List passed to a method that calls this are logically read-only parameters, so the fact they get cleared would be really strange.

Please take a look at this example:

internal readonly struct Foo
{
    internal ArraySegment<string> Segment { get; }

    // Private! Use factory method instead.
    private Foo(string[] array, int count)
    {
        Segment = new ArraySegment<string>(array, 0, count);
    }

    // TODO: Choose method name to be explicit about modifying the parameter.
    internal static Foo TransferOwnership(List<string> list)
    {
        if (list == null)
            throw new ArgumentNullException(nameof(list));

        int count = list.Count;
        return new Foo(list.MoveToArray(), count);
    }
}

Despite of passing a List<T> considered bad practise, here we've got useful guarantee: caller does not hold reference to underlying array after transfering ownership. So they cannot modify it and violate Foo's invariants. This move semantics is enforced, so no need in allocation for defensive copy.

(I'm a little biased against allocations since I'm from gamedev and Unity's GC is awful.)

qbit86 commented 5 years ago

@svick

ArraySegment<T> is probably a better choice

It is probably a better choice for dedicated ArrayBuilder<T>. Even better is to have new ArrayPrefix<T> instead because we don't need a triple (array, offset, count), only a pair (array, count) as far as offset is 0.

But for List<T> I'd prefer more simple, primitive approach. Expressed in most basic terms close to implementation.

If that's the case, then I think the returned array should be zeroed-out.

Of course not, I didn't mean that, sorry for confusion.

(Another issue with ArraySegment<T> is that it is a little bit crippled in netstandard2.0, while not so in netcoreapp2.0.)

stephentoub commented 5 years ago

I'd be skeptical about adding this to List<T>: it's effectively exposing internal implementation detail, and I'm not actually convinced it's what's really wanted/needed. If you're interested in avoiding allocations, then there's a good chance you also want to avoid the allocation for the List<T> itself. I'm also not yet convinced by the arguments about List<T> being ubiquitous; in all of the cases I've seen where this would be useful, you're building up an array using only minimal functionality, e.g. Add, and it's rare that you're doing it based on a list that someone else constructed and then handed off to you (further, in such cases, ownership semantics can get murky and I'd be concerned about destroying the list). If this is something that's needed publicly, I'd rather see it done as a struct-based ArrayBuilder<T>, e.g. like what we use in LINQ: https://github.com/dotnet/corefx/blob/master/src/Common/src/System/Collections/Generic/ArrayBuilder.cs

Clockwork-Muse commented 5 years ago

There are already a lot of lists here and there with all those enumerable.ToList() calls, in-place sorting and filtering. Adding yet another layer for creating ArrayBuilder from existing instance of concrete List then moving it to array

...uh, if you're doing ToList(), why not ToArray() instead? If somebody else's code is handing you a List, that's going to be in violation of those same guidelines you posted earlier (as opposed to IList), and even then, as @stephentoub says, ownership can be a problem (we're not Rust, here).

And while you could, potentially, splat a List into an ArrayBuilder, the "real" use case is avoiding someEnumerable.someLinqMethods.ToArray(), and more often just filtering an existing array, like in @drieseng 's earlier examples.

Speaking of those examples, looking through them again, I just noticed that one of them, possibly (not having read through all the code), would break some part of ASP .NET. So there's that...

Please take look at this example:

Um.
That doesn't necessarily save you from defensive copies, it entirely depends on who/how the List was originally created, and if they were done with it. Which is why, if I was concerned about allocations/indirection and was passing around arrays instead of IEnumerable, I'd do some form of an ArrayBuilder - basic code is trivial and quick to write. For that matter, if I was feeling fancy (and the scope was amenable), I could make it use an ArrayPool, which is something a List-based call can't really rely on.

...that's assuming I wasn't looking at Memory and stackalloc-ing stuff...

qbit86 commented 5 years ago

@Clockwork-Muse

...uh, if you're doing ToList(), why not ToArray() instead?

.ToArray() in general makes one more array allocation and copy than .ToList() (but one less allocation for List<T> itself) — at least last time I've checked this. They are both aggregating items consuming some enumerable, but when it ends, .ToList() remains with that last buffer probably with some extra capacity, while .ToArray() has to return the array with exact length.

...uh, if you're doing ToList(), why not ToArray() instead?

I've encountered different scenarios, like initialize list with .ToList() from some enumerable or .AddRange(enumerable), then populating with some more items, then reordering an so on. All those common things for which List<T> was designed to.

we're not Rust, here

Actually, we are :) At least since introduction of Span<T> here. The main semantic difference between taking span parameter or taking arraySegment parameter is that first guarantees that callee does not hold reference to passed array. See section Buffer Pooling for details. If it is not Rust's borrowing ownership, then what else is it? :)

qbit86 commented 5 years ago

@Clockwork-Muse

For that matter, if I was feeling fancy (and the scope was amenable), I could make it use an ArrayPool, which is something a List-based call can't really rely on.

This is exactly what I've done in Rist package.

qbit86 commented 5 years ago

@stephentoub

I'd rather see it done as a struct-based ArrayBuilder<T>, e.g. like what we use in LINQ

it's effectively exposing internal implementation detail

But it is ArrayBuilder<T> who exposes details in unsafe way via Buffer property. While List<T>.MoveToArray() perfectly preserves encapsulation of its instance (abandoned array is independent from List<T> instance anymore).

I agree, some form of new public ArrayBuilder<T> struct would be nice too, but List<T>.MoveToArray() just looks more convenient and natural.

drieseng commented 5 years ago

I'd prefer a public ArrayBuilder\ over List\.MoveToArray(). The reason indeed being that the latter is effectively only something you'd call on a List\ instance that you constructed yourself (in order to end up with an array of T). Even though it's not a good fit for a builder pattern, it would be great if ArrayBuilder\ could also expose a bool Contains(T value) method.

qbit86 commented 5 years ago

@svick

The issue is that you're returning the whole array, when what you logically want to return is only the valid part of the array.

There are already existing such APIs in BCL, for example MemoryStream.GetBuffer().

Comparing to this API, List<T>.ReleaseBuffer() is more safe. It does not provide access to internals of instance, because these internals immediately stop belonging to instance after the call. So mentioned safety considerations are rather about misusing array buffers and numerous existing APIs that consume array buffers — not safety of List<T> itself.

I beleive ReleaseBuffer() is strictly less dangerous than standard GetBuffer(), and it's not List<T>'s responsibility to wrap it in additional layers — inappropriate level of abstraction.

@drieseng

I'd prefer a public ArrayBuilder<T> over List<T>.MoveToArray()

I'd prefer both :) ArrayBuilder<T> providing higher level ArraySegment<T> (or even better ArrayPrefix<T>), and List<T>.ReleaseBuffer() enhancing “default” collection with missing mechanics.

stephentoub commented 5 years ago

It does not provide access to internals of instance

But it makes strong assumptions about those internals. What if, for example, in the future we "optimized" List by having it store an array of arrays of elements rather than an array of elements, in order to avoid copying every time it grows. Then this MoveToArray method doesn't make sense, highlighting that it is depending on implementation details.

qbit86 commented 5 years ago

@stephentoub

But it makes strong assumptions about those internals.

Isn't it explicitly stated that List<T> utilises growing array under the hood?

“It implements the List<T> generic interface by using an array whose size is dynamically increased as required.”

“Implements a variable-size List that uses an array of objects to store the elements. A List has a capacity, which is the allocated length of the internal array. As elements are added to a List, the capacity of the List is automatically increased as required by reallocating the internal array.”

What if, for example, in the future we "optimized" List by having it store an array of arrays

I suppose such “optimization” would have more dramatical effects than adding mising ReleaseBuffer(), so it would definitely require separate ChunkedList<T> class.

stephentoub commented 5 years ago

I'm not saying we would make such a change; it would hurt other aspects of performance. I'm saying: a) This bolts on something to a hugely widely used type it wasn't intended for b) It exposes internal implementation details c) For the use case, it's both significant overkill and at the same time not ideal.

stephentoub commented 5 years ago

Isn't it explicitly stated that List utilises growing array under the hood?

In remarks in docs. Where in the API itself?

GSPP commented 5 years ago

I have seen many requests to get access to the internal list buffer over time. I myself had a few cases where that would have been convenient. Of course, it can be very unsafe to expose a live buffer. I think this proposal is a very elegant solution for the safety aspect because it removes the buffer from the list. I actually find this to be a beautiful solution.

The semantics of MoveToArray are almost identical to Clear. There are not additional safety or complexity issues added by this proposal.

I think the API surface of this can be made very clean with the proposed ArrayPrefix<T> type. It could even look like myList.MoveToArray().ToSpan(). The ToSpan would discard the unused tail of the array.

We would need to document that List<T> is backed by a single array. This is the only implementation choice that makes sense in practice. In particular, indexer performance must be very fast. This forces a single array design. Also, compatibility concerns have long locked in this particular design for all time.

IList<T> cannot be changed. Also, some IList<T>-like collections are not implemented by a single backing array. There could be a new IBackingArrayAccess<T> but the use case for that likely is too insignificant.

A builder type (e.g. ArrayBuilder<T>) has been proposed as an alternative. There were concerns that MoveToArray is not suitable for the intended use of List<T>. I don't see why. What is the intended use case of List<T> even? Let's just make this type as useful as possible. It is a core workhorse of the framework. Let's not mess it up, either, but this proposal would not do that.

I could see the same proposal working for other types such as Queue<T> and Stack<T>. They are similarly locked in to the array based design. They might just as well expose their buffers in this newly proposed safe way.

qbit86 commented 5 years ago

@stephentoub

In remarks in docs. Where in the API itself?

I supposed that stating in official documentation is enough 🤷 Like in case when documentation says that collection is sorted, but there cannot be any evidence in API.

For the use case, it's both significant overkill and at the same time not ideal.

The use case is to avoid premature pessimisation of using list.ToArray() where list.ReleaseBuffer() is enough. It's sufficient for existing codebases which are already using List<T> as default collection. This cannot be achieved over current List<T> API.

It's not a replacement for proposed ArrayBuilder<T> struct. (Besides, to be honest, I don't see how to make ArrayBuilder<T> both mutable value type and ownership-safe at the same time, considering use-after-free.)

stephentoub commented 5 years ago

I don't see how to make ArrayBuilder both mutable value type and ownership-safe at the same type, considering use-after-free.

Convention. And the 90% use case as far as I'm aware doesn't involve passing it around, just building it up in a single method. For others, you pass it around by ref and the user is responsible for using it correctly. Yes, there's the potential for bugs. That's often the case when trying to optimize things. It's the trade-off one makes for minimizing allocations as much as possible.

stephentoub commented 5 years ago

there cannot be any evidence in API

Other than the name of the class? :smile:

NN--- commented 5 years ago

@GSPP IMO the real problem is the naming of classes. Unlike Java, in .NET names are not following any convention. For instance, it is stated in documentation that List is generic version of ArrayList. It is clear that ArrayList just uses array inside from the name but not in List. Same for Stack and Queue and other collections. They should have been called ArrayStack, ArrayQueue.

qbit86 commented 5 years ago

@NN---

Unlike Java, in .NET names are not following any convention.

Actually they do follow, but rules are slightly more complicated. I remember them from FDG book, but failed to quickly find proof link in online version.

The recommendation is that among standard implementations of some abstraction IFoo (like SomeFoo, AnotherFoo, SpecializedFoo) implementer should choose one to be “default”, and name it just Foo. So List<T> is such shorthand for ArrayList<T>, Dictionary<K, V> is HashTableDictionary<K, V>, and so on.

gfoidl commented 5 years ago

It is clear that ArrayList just uses array inside from the name but not in List.

That List<T> uses an array is an implementation detail, that could change (in theory). So it makes sense that Array is not part of the name.

NN--- commented 5 years ago

I bet List is not called ArrayList just to have difference for developers and to be inline with Stack, Queue. https://msdn.microsoft.com/en-us/library/ms379564(v=vs.80).aspx The class List<T> is analogous to the non-generic ArrayList.

qbit86 commented 5 years ago

@NN---

Unlike Java, in .NET names are not following any convention. For instance, it is stated in documentation that List is generic version of ArrayList. It is clear that ArrayList just uses array inside from the name but not in List.

The convention is: ✓ DO ensure that the names differ only by the "I" prefix on the interface name when you are defining a class–interface pair where the class is a standard implementation of the interface.

So the name was not chosen to keep the class uncertain abstraction rather than concrete data structure with predictable characteristics.

qbit86 commented 5 years ago

@stephentoub Please take a look at MemoryStream.TryGetBuffer signature:

public virtual bool TryGetBuffer(out ArraySegment<byte> buffer);

What if proposed method will have the same signature?

public bool TryReleaseBuffer(out ArraySegment<T> buffer) { ... }

Now we have flexibility in changing List<T> implementation (it will start returning false if not feasible), and it fits well existing APIs which consume Span<T> or triple (array, offset, count). And it is not something completely unfamiliar to users, because FCL already has similar method MemoryStream.TryGetBuffer.

Clockwork-Muse commented 5 years ago

@qbit86 - the problem is that MemoryStream allows you to pass it the backing array store in some of its constructors, which isn't something you can do with List.

I assume this is because of the intended use-case for MemoryStream, and you might want to create a strict, size-limited stream from memory you explicitly created, possibly a pooled array. So getting the array back out, that you put in, is really kind of expected.

With List, though, you're not in control of the growth of the backing array, or in how it's allocated, so letting that escape is problematic. And if we did change the underlying implementation, making a TryReleaseBuffer method blanket-return false would break everybody's code (or sporadically, like if List started using chunked arrays above a certain number of elements, which would be worse). At which point they'd be looking for a dedicated ArrayBuilder type, pretty much.