[API Proposal]: Add IList<T> extension methods providing in-place update operations

eiriktsarpalis commented 1 year ago

Background and motivation

The List<T> type provides a number of methods providing bulk in-place updates, such as RemoveAll, RemoveRange, Reverse and Sort. The same cannot be said about IList<T> where common in-place operations are cumbersome to achieve. Consider for instance the following System.Text.Json example used in our own docs repo:

https://github.com/dotnet/docs/blob/6c6942bcb4594dc063db293e5fff33c6bea0b331/docs/standard/serialization/system-text-json/snippets/custom-contracts/IgnoreType.cs#L26-L38

A possible alternative to the above would be to roll your own extension method, but it would be even better if we shipped extension methods like that out of the box.

API Proposal

namespace System.Collections.Generic;

public partial static class CollectionExtensions
{
    public static void AddRange<T>(this IList<T> list, IEnumerable<T> collection);
    public static void InsertRange<T>(this IList<T> list, int index, IEnumerable<T> collection);
    public static void RemoveAll<T>(this IList<T> list, Predicate<T> predicate);
    public static void RemoveRange<T>(this IList<T> list, int index, int count);
    public static void Reverse<T>(this IList<T> list);

    public static void Sort<T>(this IList<T> list);
    public static void Sort<T>(this IList<T> list, Comparison<T> comparison);
    public static void Sort<T>(this IList<T> list, IComparer<T> comparer);
    public static void Sort<T>(this IList<T> list, int index, int count, IComparer<T> comparer);
}

API Usage

Removing properties of a given type in System.Text.Json can now be expressed as follows:

    public void ModifyTypeInfo(JsonTypeInfo ti)
    {
        if (ti.Kind != JsonTypeInfoKind.Object)
            return;

        ti.Properties.RemoveAll(prop => _ignoredTypes.Contains(prop.PropertyType));
    }
}

Alternative Designs

Should we consider exposing as DIMs?
Should the Sort implementation be stable? Or match List/Array.Sort semantics?
Should we include BinarySearch extension methods?

Risks

No response

ghost commented 1 year ago

Tagging subscribers to this area: @dotnet/area-system-collections See info in area-owners.md if you want to be subscribed.

Issue Details

### Background and motivation The `List` type provides a number of methods providing bulk in-place updates, such as `RemoveAll`, `RemoveRange`, `Reverse` and `Sort`. The same cannot be said about `IList` where common in-place operations are cumbersome to achieve. Consider for instance the following System.Text.Json example used in our own docs repo: https://github.com/dotnet/docs/blob/6c6942bcb4594dc063db293e5fff33c6bea0b331/docs/standard/serialization/system-text-json/snippets/custom-contracts/IgnoreType.cs#L26-L38 A possible alternative to the above would be to [roll your own extension method](https://github.com/dotnet/docs/pull/31479#discussion_r983340659), but it would be even better if we shipped extension methods like that out of the box. ### API Proposal ```csharp namespace System.Collections.Generic; public partial static class CollectionExtensions { public static void RemoveAll(this IList list, Predicate predicate); public static void RemoveRange(this IList list, int index, int count); public static void Reverse(this IList list); // List.Sort is not stable -- should this extension be stable? public static void Sort(this IList list); public static void Sort(this IList list, Comparison comparison); public static void Sort(this IList list, IComparer comparer); public stsatic void Sort(this IList list, int index, int count, IComparer comparer); // TOCONSIDER BinarySearch extension methods? } ``` ### API Usage Removing properties of a given type in System.Text.Json can now be expressed as follows: ```csharp public void ModifyTypeInfo(JsonTypeInfo ti) { if (ti.Kind != JsonTypeInfoKind.Object) return; ti.Properties.RemoveAll(prop => _ignoredTypes.Contains(prop.PropertyType)); } } ``` ### Alternative Designs * Should we consider exposing as DIMs? * Should the `Sort` implementation be stable? Or match `List`/`Array.Sort` semantics? * Should we include [`BinarySearch`](https://learn.microsoft.com/en-us/dotnet/api/system.collections.generic.list-1.binarysearch?view=net-7.0) extension methods? ### Risks _No response_

Author:	eiriktsarpalis
Assignees:	-
Labels:	`api-suggestion`, `area-System.Collections`
Milestone:	-

stephentoub commented 1 year ago

Should the Sort implementation be stable? Or match List/Array.Sort semantics?

If we add such extension methods, they should have the same semantics as List<T>. Otherwise, changing code from:

IList<T> list = new List<T>() { ... };
list.Sort(...);

to:

List<T> list = new List<T>() { ... };
list.Sort(...);

is a very subtle and unexpected source of bugs.

If we update List<T>/Array.Sort to be stable, then such extensions should follow suit as well.

eiriktsarpalis commented 1 year ago

If we update List/Array.Sort to be stable

Assuming it doesn't regress perf we could do that, to my knowledge there is no stable in-place sorting implementation available.

eiriktsarpalis commented 1 year ago

And if not, we might consider using a different name, e.g. StableSort.

stephentoub commented 1 year ago

Assuming it doesn't regress perf we could do that, to my knowledge there is no stable in-place sorting implementation available.

There's been a long discussion of this in https://github.com/dotnet/runtime/pull/68436.

jkotas commented 1 year ago

How are the Sort methods expected to be implemented? How efficient is the implementation going to be and is it going to require yet another copy of the whole Sort algorithm?

In-place sorting of IList will require a virtual method call for each element get and set. It expect that it is often going to be faster to convert the IList to array first and then sort the array.

eiriktsarpalis commented 1 year ago

In-place sorting of IList will require a virtual method call for each element get and set. It expect that it is often going to be faster to convert the IList to array first and then sort the array.

Possibly. We might consider using an implementation that sorts using a temporary buffer, i.e.

public static void Sort<T>(this IList<T> list, IComparer<T>? comparer = null)
{
    T[] array = new T[list.Count];
    list.CopyTo(array, 0);

    Array.Sort(array, comparer);

    list.Clear();
    for (int i = 0; i < array.Length; i++)
        list[i] = array[i];
}

This might reduce the number of virtual calls, but does more allocations and copying. Benchmarking that vs. a bespoke IList<T> sorting algorithm would help determine a possible approach.

Clockwork-Muse commented 1 year ago

... for that matter, checking if it is an actual List<T> would probably be reasonable, given how commonly it's used.

eiriktsarpalis commented 1 year ago

Checking for List/Array would definitely be added to the implementation if we do follow that approach.

terrajobst commented 1 year ago

Video

It seems the proposal is mostly about convenience, not performance. As such we don't think the cons outweigh the pros
Cons
- The proposal is equivalent to a DIM in that they there is a default implementation that works for any case. The performance of these methods, whether they are DIM'ed or extension methods, would be hard to reason about it.
- Furthermore, without DIMs the only fast implementation could be in the same assembly (or lower) as CollectionExtension (which lives in corlib).
It seems consensus at this point is mixed. DIMs would be ideal, but some folks feel extension methods would be "good enough", give that these interfaces making perf a secondary concern.
For .NET 8, we should investigate the usage of DIMs in core types to see whether it's viable.

namespace System.Collections.Generic;

public partial static class CollectionExtensions
{
    public static void InsertRange<T>(this IList<T> list, int index, IEnumerable<T> collection);
    public static void RemoveAll<T>(this IList<T> list, Predicate<T> predicate);
    public static void RemoveRange<T>(this IList<T> list, int index, int count);
    public static void Reverse<T>(this IList<T> list);

    public static void Sort<T>(this IList<T> list);
    public static void Sort<T>(this IList<T> list, Comparison<T> comparison);
    public static void Sort<T>(this IList<T> list, IComparer<T> comparer);
    public static void Sort<T>(this IList<T> list, int index, int count, IComparer<T> comparer);
}

Joe4evr commented 1 year ago

One thing that came to mind as I was watching the API review: Introducing Reverse might be source-breaking in certain existing cases as either it could tie overload resolution with LINQ's Reverse, or if it was introduced as a DIM on IList<T> it could win resolution and break LINQ method chains.

At least, I think that could happen, please correct me if I'm wrong.

stephentoub commented 1 year ago

Introducing Reverse might be source-breaking in certain existing cases

Yes, although that's the case pretty much any time we add any instance method or extension method. For instance methods, if someone else had named an extension method the same thing, the newly-added instance method would start winning. And for extension methods, if someone else also has an extension method with the same shape, it would introduce ambiguities that break compilation.

krwq commented 1 year ago

Possibly they could live in the different namespace to avoid this problem (i.e. System.Collections.Generic.Extensions) as a side note what does DIM stand for? (dimension doesn't seem to make sense in the context)

stephentoub commented 1 year ago

what does DIM stand for

Default Interface Methods... basically adding virtual methods to an interface.

GSPP commented 1 year ago

Collection capabilities could be added with interfaces such as:

public interface ISortable //No base type
{
    void Sort(...);
}

And CollectionExtensions.Sort<t>(IList<T> list, ...) would then try to cast and delegate to an efficient implementation. Note that this interface does not derive from IList<T>. This design does not require crafting elaborate type hierarchies.

Conceptual sample implementation:

public static void Sort<T>(this IList<T> list, ...)
{
    if (list is ISortable<T> sortable) sortable.Sort(...);
    else SortIList(list); //Generalized algorithm for any IList<T>
}

This idea could be applied to other algorithms such as binary search as well. Others that come to mind: ToList, ToArray, the mutation operations for ranges from the opening post, copying to and from spans, linear search, ... All of these can then break through the abstraction and operate on the internal representation more efficiently.

This has some overheads but it results in reasonable behavior and performance in most cases. It is hard for users to really go wrong with this. Invocations on small data sets will feel some overhead, but performance at scale will be near optimal both asymptotically as well as practically.

This should also play nice with overload resolution. The compiler will automatically prefer ways of calling that have less indirection in them. It also should be a highly compatible design. It's easy to upgrade types and add new algorithms.

KrzysztofCwalina commented 11 months ago

IList<T> extension AddRange is very important for the Azure SDK. We expose 90%+ of model collection properties as IList<T> and very often our users complain they have no convenient and efficient* way to add ranges.

* I would love if the implementation attempted to cast to List<T> and used List<T>.AddRange in the fast path branch.

stephentoub commented 11 months ago

@CyrusNajmabadi, @cston, if we add an AddRange extension method on IList<T> or a DIM IList<T>.AddRange method, what implications will that have on collection expressions?

cston commented 11 months ago

what implications will that have on collection expressions?

When constructing a collection for a collection expression targeting IList<T> or a type that implements IList<T>, the compiler may use an AddRange extension method or instance method.

Currently, the compiler is not using AddRange methods when constructing collections from collection expressions, but we expect to use AddRange in the future, particularly when spreads are involved:

class MyList<T> : IList<T> { ... }

IEnumerable<int> x = [1, 2, 3];
MyList<int> y = [..x]; // y = new MyList(); ((IList<int>)y).AddRange(x);

terrajobst commented 11 months ago

IList<T> extension AddRange is very important for the Azure SDK. We expose 90%+ of model collection properties as IList<T>

What is the type of the underlying instance? According to our guidelines, we don't expect libraries to expose their data as IList<T> but as some kind of class that can expose additional APIs.

terrajobst commented 11 months ago

@stephentoub @cston why does it matter if IList<T> would have an extension method? I'd assume the compiler would have to decide which type to instantiate (which I'd assume is List<T>) and what would matter for optimizations is that which APIs the concrete type would expose, rather than the target type, no?

stephentoub commented 11 months ago

@terrajobst, it's about populating the destination type. If you have:

class Foo : IList<T>
{
    public Foo() { }
    public void Add(T item) { ... }
    ... // IList implementation
}

and someone writes

IEnumerable<T> data = ...;
Foo foo = [..data];

does the C# compiler generate:

Foo foo = new Foo();
foreach (T item in data) foo.Add(item);

or:

Foo foo = new Foo();
CollectionExtensions.AddRange(foo, data);

or...

terrajobst commented 11 months ago

Right, so what matters is whether Foo (an instantiable type) has AddRange, not some interface abstraction (such as IList<T>) does, right?

stephentoub commented 11 months ago

Right, so what matters is whether Foo (an instantiable type) has AddRange, not some interface abstraction (such as IList) does, right?

I don't understand. Here Foo implements IList<T> and does not expose an AddRange. My question to Chuck was exactly whether the compiler will use an extension AddRange on such a type in that case.

But there's also the case where the target type is an interface, e.g. this is perfectly legal:

IList<T> list = [..data];

and the question there applies as well: would the compiler use an extension AddRange.

Chuck's answer was that the planned answer to both questions is "yes".

terrajobst commented 11 months ago

Ah, got it. That makes sense.

I'd hope that the compiler would prefer an AddRange() method on the concrete type over an AddRange() extension method though, just like it generally binds when both exist.

KrzysztofCwalina commented 11 months ago

What is the type of the underlying instance? According to our guidelines, we don't expect libraries to expose their data as IList but as some kind of class that can expose additional APIs.

In most cases it's List<T>, but sometimes a custom internal types. BTW, the reason we do it is that we want to be able to switch implementation from List<T> to a collection that can observe changes made to it. And this is the reason we don't follow the guideline, which I have to say I was not even aware of, nor I understand why we would have it.

terrajobst commented 11 months ago

See these, specifically

✔️ DO use Collection or a subclass of Collection for properties or return values representing read/write collections.

KrzysztofCwalina commented 11 months ago

Ah, yeah. That one. The biggest regret of my life :-)

terrajobst commented 11 months ago

Seems to cut both ways though. Often times where we shipped a collection property as IList<Something> there were issues along the lines of "man I wish I could add a method or property to this thing". Another problem we saw is that people (sometimes inadvertently) downcast the interface to the concrete type which made it hard for us to change the concrete type later, but this can be mitigated by using an internal type that can't be downcast.

CyrusNajmabadi commented 11 months ago

Pulling in @jnm2 @cston @RikkiGibson and @captainsafia for the collection-expr side of this.

I have thoughts (and my initial thinking is that this is actually a good thing for collection-exprs). But i'm heads down on fire, so i'll write things out later :)

CyrusNajmabadi commented 11 months ago

Ok. So the topic of AddRange has come up for when C#/roslyn sees something like this:

// Where SomeDestCollection does *not* have the CollectionBuilderAttribute on it.
SomeDestCollection x = [x, y, z, .. w, .. v];

The default emit for this will effectively be:

SomeDestCollection __result = new();
__result.EnsureCapacity(3 + __w.Length + __v.Length); // if EnsureCapacity exists and `w/v` are statically known to expose lengths.
__result.Add(__x);
__result.Add(__y);
__result.Add(__z);
foreach (var __t in __w)
  __result.Add(__t);
foreach (var __t in __v)
  __result.Add(__t);

However, if we do find available AddRange methods we are permitted to use them, though the exact circumstances by which the compiler will do so are still up in the air and need thinking on when they would be appropriate. But, if they were available, the compiler could potentially generate something like teh following:

SomeDestCollection __result = new();
__result.EnsureCapacity(3 + __w.Length + __v.Length); // if EnsureCapacity exists and `w/v` are statically known to expose lengths.
__result.AddRange(new InlineArray3<> { __x, __y, __z }.AsSpan()); // If there's an AddRange that takes spans.
__result.AddRange(__w); // if there are AddRange's that support the instances being spread.
__result.AddRange(__v);

However, we do have some general concerns about this. Consider a real world example:

HashSet<int> hashSet = ...;
List<int> list = [1, 2, 3, .. hashSet];

With this proposal, we would now see an AddRange on List<T> that could take that hashSet. However, would that actually be a good thing to call? Instead of a no-alloc approach to adding all the elements (by enumerating hashSet in the caller), this would call into the IEnumerable codepath, which woudl allocate the IEnumerator and would be doing virtual calls on the iteration.

As such, while we're still only in the preliminary thinking stage, we're leaning toward only using these helpers when the exact types match (so we're not losing information, or the ability to enumerate without allocation), or perhaps when we know the input types, and we're calling some well known BCL AddRange method.

For example, if we knew the above AddRange was fine when you passed in a List<T> (Because it type checked for that type under the covers) and we knew we were passing a List<T>, then we might consider it ok. The LDM has 100% approved that teh compiler can take what it knows about the BCL and optimize heavily given the latest shipping BCL impl. We will, however, likely be more conservative with user patterns since we don't know if calling them might actually cause a negative impact on final performance.

jnm2 commented 11 months ago

I'd be hesitant to let the compiler call an AddRange<T>(IList<T>) extension method since, in all cases I've thought of, it's pure overhead compared to what the compiler could be doing instead. That overhead includes:

The boxing of any of the arguments to AddRange
Any type checks that the extension method might do, like is List<T>
Interface IList<T>.Add calls instead of concrete Add calls on the collection type being constructed

However, it is less IL emitted at the collection expression site, and I'm not sure if that's interesting as a code size benefit.

Neme12 commented 5 months ago

Instead of extension methods, maybe these could be sealed methods directly on the interfaces? That would add the opportunity to later make them defaulted/virtual. Or would that be a breaking change?

For .NET 8, we should investigate the usage of DIMs in core types to see whether it's viable.

They could be non-DIM on the interface, if we're afraid of DIMs on core types.

It would also mean they don't show up on concrete implementations, thus alleviating the compiler concern.

Neme12 commented 5 months ago

It seems the proposal is mostly about convenience, not performance.

I usually define these extension methods so that they could be specialized for common types like List<T>, thus potentially having better perf than inserting manually one by one, depending on the implementation.

Neme12 commented 5 months ago

Often times where we shipped a collection property as IList<Something> there were issues along the lines of "man I wish I could add a method or property to this thing".

I mean that's what DIMs are for 😄

I think it's really time for the DIMs. They are already used all over the place for core primitive types like Int32 due to generic math. They already broke C++/CLI and they already fixed it. This can't be more breaking or more dangerous than adding interfaces with static abstracts, DIMs and operators to core types like Int32. The band-aid was already ripped off.

Neme12 commented 5 months ago

I wouldn't add all of these overloads as DIMs though:

    public static void Sort<T>(this IList<T> list);
    public static void Sort<T>(this IList<T> list, Comparison<T> comparison);
    public static void Sort<T>(this IList<T> list, IComparer<T> comparer);
    public static void Sort<T>(this IList<T> list, int index, int count, IComparer<T> comparer);

I think only one of them should be a DIM and virtual, probably the last one. The rest should be non-virtual (sealed) and delegate to that one.

Neme12 commented 5 months ago

I don't agree that we shouldn't add the default implementation because it would be slow. It would be no more slower than what people can already do today on top of the interface manually (and what they do do), or what they could possibly ever do on top of the interface. If we had a second interface like IBulkOperationsList<T>, then anyone who wanted to e.g. InsertRange would have to check if the list is an IBulkOperationsList<T> and if it is, use InsertRange, otherwise do it manually on top of Insert. At that point, you might as well have a single implementation on top of Insert without the check for the code to be simpler. This would defeat the entire purpose of the API - it should do this check for you and forward to the best possible implementation.

EDIT:

At that point, you might as well have a single implementation on top of Insert without the check for the code to be simpler.

Or you'd add your own extension method that does the appropriate thing. At which point, we might as well have the DIM, since this would be equivalent to that.

Actually, even the default DIM would be much better than people doing their own custom logic or their own extension methods like today, because they will probably implement it naively by repeatedly doing Insert or Remove from the middle when it's possible to implement it in a better way by copying and removing from the end.

kronic commented 3 months ago

Maybe it's worth adding more methods

public void Move(int oldIndex, int newIndex);
public void MoveRange(int fromIndex, int toIndex, int count);

julealgon commented 3 months ago

See these, specifically

✔️ DO use Collection or a subclass of Collection for properties or return values representing read/write collections.

@terrajobst is there more context on why this specific guideline was created? I've seen it before but I'll be honest in saying I never follow it, instead preferring to return either ICollection or IList or some other interface like the read-only or immutable variants.

I'm aware there is some overhead with the fact it's an interface (with the virtual calls and such), but does the advice still hold as strongly as it did way back when it was defined at the "beginnings" of .NET, considering all the advancements made to performance? Were there more concerns other than performance itself?

Collection capabilities could be added with interfaces such as:
public interface ISortable //No base type
{
    void Sort(...);
}

I noticed nobody commented on this proposal from @GSPP . It was one of the first things that came to my mind when reading this proposal too. Is this not a good approach for some reason?

Maybe there would need to be a distinction between "sortable in-place" vs the standard concept of being sortable which is achievable just by implementing IComparable though.

dotnet / runtime