Closed epeshk closed 1 year ago
Tagging subscribers to this area: @dotnet/area-system-linq See info in area-owners.md if you want to be subscribed.
Author: | epeshk |
---|---|
Assignees: | - |
Labels: | `api-suggestion`, `area-System.Linq` |
Milestone: | - |
MinBy
and MaxBy
also existed in MoreLinq before adding to BCL
@epeshk yep, true. I am just pointing out that if someone wants this functionality today, it is offered by a respectable third-party library. No need to wait until the advent of .NET 8 (assuming that this API proposal will be approved).
I'm marking this as Future for now because we don't have this on our plans for now. If anyone is willing to contribute implementation I will change it to 8.0 which should give API review higher priority
I want to contribute this feature and start working on it in https://github.com/epeshk/runtime/tree/countBy Should I open PR now, before API review?
Please hold until the issue is marked api-approved. Thanks!
Please also note return type and nullability concerns. It is probably good to discuss and remove bad options before review.
I think good choices for the return type are (ordered by my preferences):
ICounts<TKey, int> : IEnumerable<(TKey Key, int Count)>
IEnumerable<(TKey Key, int Count)>
IReadOnlyDictionary<TKey, int>
IEnumerable<KeyValuePair<TKey, int>>
This method also may or may not accept nulls as TKey
. An implementation without nulls would be simpler, but we have to think about it in terms of usability, and e.g. similar ToLookup
accept nulls.
For source-level compatibility with MoreLinq and SuperLinq, it would be helpful to keep return type as either ICounts<TKey, int>
, IEnumerable<(TKey Key, int Count)>
, or IEnumerable<KeyValuePair<TKey, int>>
. If this is done, then these libraries can simply disable their version of this operator starting in .net8 and consumers can be comfortable using the same code through multiple versions of .net/SuperLinq/MoreLinq.
Also, both MoreLinq and SuperLinq allow null
values for TKey
, if that impacts decision making at all.
Seeing as OrderBy
and OrderByDescending
had overloads to remove their predicate in the event that you were selecting the input as the key, should that be applied here?
x.CountBy();
x.CountBy(y => y.Key);
x.LongCountBy();
x.LongCountBy(y => y.Key);
Also, I feel like prefixing the methods with Long
removes their discoverability in intelisense. and in documentation.
@KieranDevvs Those versions of OrderBy
also remove the By
suffix, but Enumerable.Count()
and Enumerable.LongCount()
already exist and do something different than the CountBy
proposed here.
And since this is a fairly niche method, I don't think it's important for it to have a more succinct way to write .CountBy(x => x)
.
@krwq @eiriktsarpalis Since @epeshk offered to contribute implementation, should this be moved back to the 8.0 milestone, to speed up API review?
Any chance to get it in for the .NET 8 release?
@WeihanLi very unlikely at this point.
Name suggestions:
Counts()
/ CountsBy(keySelector)
GroupCount()
/ GroupCountBy(keySelector)
A potential alternative is having an AggregateBy
method:
public static IEnumerable<KeyValuePair<TKey, TAccumulate>> AggregateBy<TSource, TAccumulate, TKey>(
this IEnumerable<TSource> source,
TAccumulate seed,
Func<TAccumulate, TKey, TSource, TAccumulate> func,
Func<TSource, TKey> keySelector);
which lets you express CountBy
like so
source.AggregateBy(0, (count, _, _) => ++count, keySelector);
A potential alternative is having an
AggregateBy
method:public static IEnumerable<KeyValuePair<TKey, TAccumulate>> AggregateBy<TSource, TAccumulate, TKey>( this IEnumerable<TSource> source, TAccumulate seed, Func<TAccumulate, TKey, TSource, TAccumulate> func, Func<TSource, TKey> keySelector);
which lets you express
CountBy
like sosource.AggregateBy(0, (count, _, _) => ++count, keySelector);
I would suggest that the seed
may be determined by the key, so I propose:
public static IEnumerable<(TKey key, TAccumulate result)> AggregateBy<TSource, TKey, TAccumulate>(
this IEnumerable<TSource> source,
Func<TSource, TKey> keySelector,
Func<TKey, TAccumulate> seedSelector,
Func<TAccumulate, TKey, TSource, TAccumulate> accumulator)
which lets you express CountBy
like so
source.AggregateBy(_ => 0, (count, _, _) => ++count, keySelector);
namespace System.Linq;
public static partial class Enumerable
{
public static IEnumerable<KeyValuePair<TKey, int>> CountBy<TSource, TKey>(
this IEnumerable<TSource> source,
Func<TSource, TKey> keySelector,
IEqualityComparer<TKey>? keyComparer = null) where TKey : notnull;
}
public static partial class Queryable
{
public static IQueryable<KeyValuePair<TKey, int>> CountBy<TSource, TKey>(
this IQueryable<TSource> source,
Expression<Func<TSource, TKey>> keySelector,
IEqualityComparer<TKey>? keyComparer = null) where TKey : notnull;
}
Bringing back for API review since the question of eager vs. delayed evaluation was brought up, see
https://github.com/dotnet/runtime/pull/91507#discussion_r1314846138
Proposed alternative API with eager semantics:
namespace System.Linq;
public static partial class Enumerable
{
public static IReadOnlyDictionary<TKey, int> CountBy<TSource, TKey>(
this IEnumerable<TSource> source,
Func<TSource, TKey> keySelector,
IEqualityComparer<TKey>? keyComparer = null) where TKey : notnull;
}
public static partial class Queryable
{
public static IQueryable<IReadOnlyDictionary<TKey, int>> CountBy<TSource, TKey>(
this IQueryable<TSource> source,
Expression<Func<TSource, TKey>> keySelector,
IEqualityComparer<TKey>? keyComparer = null) where TKey : notnull;
}
After a long discussion we ended up back at the previously-approved shape, with a delayed evaluation.
namespace System.Linq;
public static partial class Enumerable
{
public static IEnumerable<KeyValuePair<TKey, int>> CountBy<TSource, TKey>(
this IEnumerable<TSource> source,
Func<TSource, TKey> keySelector,
IEqualityComparer<TKey>? keyComparer = null) where TKey : notnull;
}
public static partial class Queryable
{
public static IQueryable<KeyValuePair<TKey, int>> CountBy<TSource, TKey>(
this IQueryable<TSource> source,
Expression<Func<TSource, TKey>> keySelector,
IEqualityComparer<TKey>? keyComparer = null) where TKey : notnull;
}
For Max
/MaxBy
, Order
/OrderBy
, etc. the only difference is the predicate parameter, but they still do very similar work. You are trying to introduce a CountBy
which does something related, but still very different than Count
. A better name would be GroupCountBy
which would allow you to also have a GroupCount
method that would count duplicates.
This was discussed, however CountBy
is a fairly standard name in methods of this sort across ecosystems.
Draft implementation: alternative design 2
Background and motivation
Currently, it is not so easy to count occurrences of
IEnumerable<T>
elements with existing LINQ methods in C#.Motivation
GroupBy
for counting because it materializesGrouping
objects with a list of values related to each key..Count()
will not enumerate the sequence again.Other language examples
F#:
Python:
Kotlin:
Benchmark
Benchmark results:
Benchmark for case without repetitions:
Seq = Enumerable.Range(0, SequenceLength).ToArray();
API Proposal
API Usage
Find any most frequent element:
Find duplicates:
Alternative Designs
1. With separated result type, inspired by
ILookup
2. Same as previous, but without
TKey : notnull
constraintFind duplicates:
3.
The extension method for
IEnumerable<IGrouping<TKey, TValue>>
with internal knowledge aboutGroupingEnumerator
allowing to Count values without materializing lists in memory.Questions to decide:
Should we use
IEnumerable<KeyValuePair<TKey, TValue>>
as a return type or specific type instead for better naming, likeIEnumerable<(TKey Key, TCount Count)>
:Find duplicates:
Should we return
IEnumerable<>
for simplicity or go withIDictioinary<>/IReadOnlyDictionary<>/ICounts<>
to avoid copying? Specific return type could improve performance for ~x1.5 for cases when user needs a lookup. F# returns sequence, Kotlin — map, Python — special map-like typeShould
TKey
allownull
? For example,ToLookup
allownull
andDictionary
not.Risks
Looks like there is no C# method with the same name in
dotnet
organization projects. https://github.com/search?q=org%3Adotnet+CountBy&type=code