dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.17k stars 4.72k forks source link

[API Proposal]: Linq CountBy method #77716

Closed epeshk closed 1 year ago

epeshk commented 1 year ago

Draft implementation: alternative design 2

Background and motivation

Currently, it is not so easy to count occurrences of IEnumerable<T> elements with existing LINQ methods in C#.

var arr = new[]{1, 2, 2, 3, 3, 3};

Dictionary<int, int> counts = arr
  .GroupBy(x => x)
  .ToDictionary(x => x.Key, x => x.Count());

Motivation

Other language examples

F#:

let arr = [|1; 2; 2; 3; 3; 3|]
let counts: seq<int * int> = arr |> Seq.countBy id

Python:


from collections import Counter

arr = [1, 2, 2, 3, 3, 3]
counts: Counter = Counter(arr)

Kotlin:

val arr = arrayOf(1, 2, 2, 3, 3, 3)
val counts: Map<Int, Int> = arr.groupingBy { it }.eachCount()

Benchmark

Benchmark results:

BenchmarkDotNet=v0.13.2, OS=Windows 10 (10.0.19044.2130/21H2/November2021Update)
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=7.0.100-rc.2.22477.23
  [Host]     : .NET 7.0.0 (7.0.22.47203), X64 RyuJIT AVX2
  DefaultJob : .NET 7.0.0 (7.0.22.47203), X64 RyuJIT AVX2
Method Mean Error StdDev Gen0 Gen1 Allocated
Grouping 1,031.7 us 18.61 us 17.41 us 60.5469 39.0625 1017.17 KB
LookupDict 18.919 ms 0.3766 ms 0.6790 ms 1343.7500 1312.5000 937.5000 10.22 MB
Lookup 16.927 ms 0.3320 ms 0.4078 ms 625.0000 593.7500 218.7500 7.44 MB
Counting 673.5 us 5.99 us 5.31 us - - 9.52 KB
CountingCast 635.3 us 2.42 us 2.26 us - - 7.25 KB

Benchmark for case without repetitions: Seq = Enumerable.Range(0, SequenceLength).ToArray();

Method Mean Error StdDev Gen0 Gen1 Gen2 Allocated
Grouping 18.902 ms 0.3639 ms 0.4469 ms 1343.7500 1312.5000 937.5000 10.22 MB
LookupDict 22.138 ms 0.4110 ms 0.4037 ms 1468.7500 1437.5000 937.5000 16.15 MB
Lookup 14.548 ms 0.0455 ms 0.0380 ms 968.7500 953.1250 468.7500 10.39 MB
Counting 3.844 ms 0.0307 ms 0.0287 ms 1007.8125 992.1875 992.1875 4.21 MB
CountingCast 2.211 ms 0.0271 ms 0.0240 ms 652.3438 636.7188 636.7188 2.77 MB

API Proposal

namespace System.Linq;

public static partial class Enumerable
{
    public static IEnumerable<KeyValuePair<TKey, int>> CountBy<TSource, TKey>(
        this IEnumerable<TSource> source,
        Func<TSource, TKey> keySelector,
        IEqualityComparer<TKey>? keyComparer = null) where TKey : notnull;

    public static IEnumerable<KeyValuePair<TKey, long>> LongCountBy<TSource, TKey>(
        this IEnumerable<TSource> source,
        Func<TSource, TKey> keySelector,
        IEqualityComparer<TKey>? keyComparer = null) where TKey : notnull;
}

API Usage

var array = new []{1, 2, 2, 3, 3, 3};

var counts = array.CountBy(x => x);

foreach (var (key, count) in counts)
    Console.WriteLine($"{key}: {count}");

Find any most frequent element:

enumerable
    .CountBy(x => x)
    .MaxBy(x => x.Value)
    .Key;

Find duplicates:

enumerable
    .CountBy(x => x)
    .Where(x => x.Value > 1)
    .Select(x => x.Key);

Alternative Designs

1. With separated result type, inspired by ILookup

namespace System.Linq;
public interface ICounts<TKey, TCount> : IEnumerable<(TKey Key, TCount Count)> where TKey : notnull
{
    int KeyCount { get; }

    TCount this[TKey key] { get; }

    bool Contains(TKey key);
}

public static partial class Enumerable
{
    public static ICounts<TKey, int> CountBy<TSource, TKey>(this IEnumerable<TSource> source, Func<TSource, TKey> keySelector) where TKey : notnull;
    public static ICounts<TKey, int> CountBy<TSource, TKey>(this IEnumerable<TSource> source, Func<TSource, TKey> keySelector, IEqualityComparer<TKey>? keyComparer) where TKey : notnull;
    public static ICounts<TKey, long> LongCountBy<TSource, TKey>(this IEnumerable<TSource> source, Func<TSource, TKey> keySelector) where TKey : notnull;
    public static ICounts<TKey, long> LongCountBy<TSource, TKey>(this IEnumerable<TSource> source, Func<TSource, TKey> keySelector, IEqualityComparer<TKey>? keyComparer) where TKey : notnull;
}

2. Same as previous, but without TKey : notnull constraint

namespace System.Linq;
public interface ICounts<TKey, TCount> : IEnumerable<(TKey Key, TCount Count)>
{
    int KeyCount { get; }

    TCount this[TKey key] { get; }

    bool Contains(TKey key);
}

public static partial class Enumerable
{
    public static ICounts<TKey, int> CountBy<TSource, TKey>(this IEnumerable<TSource> source, Func<TSource, TKey> keySelector);
    public static ICounts<TKey, int> CountBy<TSource, TKey>(this IEnumerable<TSource> source, Func<TSource, TKey> keySelector, IEqualityComparer<TKey>? keyComparer);
    public static ICounts<TKey, long> LongCountBy<TSource, TKey>(this IEnumerable<TSource> source, Func<TSource, TKey> keySelector);
    public static ICounts<TKey, long> LongCountBy<TSource, TKey>(this IEnumerable<TSource> source, Func<TSource, TKey> keySelector, IEqualityComparer<TKey>? keyComparer);
}
Find duplicates:
ICounts<TKey, int> counts = enumerable.CountBy(x => x);
IEnumerable<TKey> duplicates = counts
    .Where(x => x.Count > 1)
    .Select(x => x.Key);

3.

The extension method for IEnumerable<IGrouping<TKey, TValue>> with internal knowledge about GroupingEnumerator allowing to Count values without materializing lists in memory.

array.GroupBy(x => x).Counts();

Questions to decide:

Risks

Looks like there is no C# method with the same name in dotnet organization projects. https://github.com/search?q=org%3Adotnet+CountBy&type=code

ghost commented 1 year ago

Tagging subscribers to this area: @dotnet/area-system-linq See info in area-owners.md if you want to be subscribed.

Issue Details
### Background and motivation Currently, it is not so easy to count occurrences of `IEnumerable` elements with existing LINQ methods in C#. ```csharp var arr = new[]{1, 2, 2, 3, 3, 3}; var counts = arr .GroupBy(x => x) .ToDictionary(x => x.Key, x => x.Count()); ``` ### Motivation - It is not efficient to use `GroupBy` for counting because it materializes `Grouping` objects with a list of values related to each key. - It is not intuitive that `.Count()` will not enumerate the sequence again. - Other popular languages provide simpler API for this ### Other language examples F#: ```fsharp let arr = [|1; 2; 2; 3; 3; 3|] let counts = arr |> Seq.countBy id ``` Python: ```py from collections import Counter arr = [1, 2, 2, 3, 3, 3] counts = Counter(arr) ``` Kotlin: ```kt val arr = arrayOf(1, 2, 2, 3, 3, 3) val counts = arr.groupingBy { it }.eachCount() ``` ### Benchmark [Benchmark](https://gist.github.com/epeshk/4afa1ecab272c76e25e2b0e6f29da232) results: ``` BenchmarkDotNet=v0.13.2, OS=Windows 10 (10.0.19044.2130/21H2/November2021Update) AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores .NET SDK=7.0.100-rc.2.22477.23 [Host] : .NET 7.0.0 (7.0.22.47203), X64 RyuJIT AVX2 DefaultJob : .NET 7.0.0 (7.0.22.47203), X64 RyuJIT AVX2 | Method | Mean | Error | StdDev | Gen0 | Gen1 | Allocated | |------------- |-----------:|---------:|---------:|--------:|--------:|-----------:| | Grouping | 1,031.7 us | 18.61 us | 17.41 us | 60.5469 | 39.0625 | 1017.17 KB | | Counting | 673.5 us | 5.99 us | 5.31 us | - | - | 9.52 KB | | CountingCast | 635.3 us | 2.42 us | 2.26 us | - | - | 7.25 KB | ``` ### API Proposal ```csharp namespace System.Collections.Generic; public static partial class Enumerable { public static IEnumerable> CountBy( this IEnumerable source, Func keySelector, IEqualityComparer? keyComparer = null) where TKey : notnull } ``` ### API Usage ```csharp var array = new []{1, 2, 2, 3, 3, 3}; var counts = array.CountBy(x => x); foreach (var (key, count) in counts) Console.WriteLine($"{key}: {count}"); ``` ### Alternative Designs // 1 ```csharp namespace System.Collections.Generic; public static partial class Enumerable { public static IEnumerable> LongCountBy(this IEnumerable source, Func keySelector, IEqualityComparer? keyComparer = null) where TKey : notnull } ``` // 2 The extension method for `IEnumerable>` with internal knowledge about `GroupingEnumerator` allowing to Count values without materializing lists in memory. ```csharp array.GroupBy(x => x).Counts(); ``` ### Risks Looks like there is no C# method with the same name in `dotnet` organization projects. https://github.com/search?q=org%3Adotnet+CountBy&type=code
Author: epeshk
Assignees: -
Labels: `api-suggestion`, `area-System.Linq`
Milestone: -
theodorzoulias commented 1 year ago

The CountBy LINQ operator already exists in the MoreLinq package, offering the same functionality with an identical API.

public static IEnumerable<KeyValuePair<TKey, int>> CountBy<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    IEqualityComparer<TKey> comparer);
epeshk commented 1 year ago

MinBy and MaxBy also existed in MoreLinq before adding to BCL

theodorzoulias commented 1 year ago

@epeshk yep, true. I am just pointing out that if someone wants this functionality today, it is offered by a respectable third-party library. No need to wait until the advent of .NET 8 (assuming that this API proposal will be approved).

krwq commented 1 year ago

I'm marking this as Future for now because we don't have this on our plans for now. If anyone is willing to contribute implementation I will change it to 8.0 which should give API review higher priority

epeshk commented 1 year ago

I want to contribute this feature and start working on it in https://github.com/epeshk/runtime/tree/countBy Should I open PR now, before API review?

eiriktsarpalis commented 1 year ago

Please hold until the issue is marked api-approved. Thanks!

epeshk commented 1 year ago

Please also note return type and nullability concerns. It is probably good to discuss and remove bad options before review.

I think good choices for the return type are (ordered by my preferences):

This method also may or may not accept nulls as TKey. An implementation without nulls would be simpler, but we have to think about it in terms of usability, and e.g. similar ToLookup accept nulls.

viceroypenguin commented 1 year ago

For source-level compatibility with MoreLinq and SuperLinq, it would be helpful to keep return type as either ICounts<TKey, int>, IEnumerable<(TKey Key, int Count)>, or IEnumerable<KeyValuePair<TKey, int>>. If this is done, then these libraries can simply disable their version of this operator starting in .net8 and consumers can be comfortable using the same code through multiple versions of .net/SuperLinq/MoreLinq.

viceroypenguin commented 1 year ago

Also, both MoreLinq and SuperLinq allow null values for TKey, if that impacts decision making at all.

KieranDevvs commented 1 year ago

Seeing as OrderBy and OrderByDescending had overloads to remove their predicate in the event that you were selecting the input as the key, should that be applied here?

x.CountBy();
x.CountBy(y => y.Key);
x.LongCountBy();
x.LongCountBy(y => y.Key);

Also, I feel like prefixing the methods with Long removes their discoverability in intelisense. and in documentation.

svick commented 1 year ago

@KieranDevvs Those versions of OrderBy also remove the By suffix, but Enumerable.Count() and Enumerable.LongCount() already exist and do something different than the CountBy proposed here.

And since this is a fairly niche method, I don't think it's important for it to have a more succinct way to write .CountBy(x => x).

@krwq @eiriktsarpalis Since @epeshk offered to contribute implementation, should this be moved back to the 8.0 milestone, to speed up API review?

WeihanLi commented 1 year ago

Any chance to get it in for the .NET 8 release?

eiriktsarpalis commented 1 year ago

@WeihanLi very unlikely at this point.

MizardX commented 1 year ago

Name suggestions:

eiriktsarpalis commented 1 year ago

A potential alternative is having an AggregateBy method:

public static IEnumerable<KeyValuePair<TKey, TAccumulate>> AggregateBy<TSource, TAccumulate, TKey>(
    this IEnumerable<TSource> source, 
    TAccumulate seed, 
    Func<TAccumulate, TKey, TSource, TAccumulate> func, 
    Func<TSource, TKey> keySelector);

which lets you express CountBy like so

source.AggregateBy(0, (count, _, _) => ++count, keySelector);
viceroypenguin commented 1 year ago

A potential alternative is having an AggregateBy method:

public static IEnumerable<KeyValuePair<TKey, TAccumulate>> AggregateBy<TSource, TAccumulate, TKey>(
    this IEnumerable<TSource> source, 
    TAccumulate seed, 
    Func<TAccumulate, TKey, TSource, TAccumulate> func, 
    Func<TSource, TKey> keySelector);

which lets you express CountBy like so

source.AggregateBy(0, (count, _, _) => ++count, keySelector);

I would suggest that the seed may be determined by the key, so I propose:

public static IEnumerable<(TKey key, TAccumulate result)> AggregateBy<TSource, TKey, TAccumulate>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TKey, TAccumulate> seedSelector,
    Func<TAccumulate, TKey, TSource, TAccumulate> accumulator)

which lets you express CountBy like so

source.AggregateBy(_ => 0, (count, _, _) => ++count, keySelector);
bartonjs commented 1 year ago

Video

namespace System.Linq;

public static partial class Enumerable
{
    public static IEnumerable<KeyValuePair<TKey, int>> CountBy<TSource, TKey>(
        this IEnumerable<TSource> source,
        Func<TSource, TKey> keySelector,
        IEqualityComparer<TKey>? keyComparer = null) where TKey : notnull;
}

public static partial class Queryable
{
    public static IQueryable<KeyValuePair<TKey, int>> CountBy<TSource, TKey>(
        this IQueryable<TSource> source,
        Expression<Func<TSource, TKey>> keySelector,
        IEqualityComparer<TKey>? keyComparer = null) where TKey : notnull;
}
eiriktsarpalis commented 1 year ago

Bringing back for API review since the question of eager vs. delayed evaluation was brought up, see

https://github.com/dotnet/runtime/pull/91507#discussion_r1314846138

Proposed alternative API with eager semantics:

namespace System.Linq;

public static partial class Enumerable
{
    public static IReadOnlyDictionary<TKey, int> CountBy<TSource, TKey>(
        this IEnumerable<TSource> source,
        Func<TSource, TKey> keySelector,
        IEqualityComparer<TKey>? keyComparer = null) where TKey : notnull;
}

public static partial class Queryable
{
    public static IQueryable<IReadOnlyDictionary<TKey, int>> CountBy<TSource, TKey>(
        this IQueryable<TSource> source,
        Expression<Func<TSource, TKey>> keySelector,
        IEqualityComparer<TKey>? keyComparer = null) where TKey : notnull;
}
bartonjs commented 1 year ago

After a long discussion we ended up back at the previously-approved shape, with a delayed evaluation.

namespace System.Linq;

public static partial class Enumerable
{
    public static IEnumerable<KeyValuePair<TKey, int>> CountBy<TSource, TKey>(
        this IEnumerable<TSource> source,
        Func<TSource, TKey> keySelector,
        IEqualityComparer<TKey>? keyComparer = null) where TKey : notnull;
}

public static partial class Queryable
{
    public static IQueryable<KeyValuePair<TKey, int>> CountBy<TSource, TKey>(
        this IQueryable<TSource> source,
        Expression<Func<TSource, TKey>> keySelector,
        IEqualityComparer<TKey>? keyComparer = null) where TKey : notnull;
}
MizardX commented 1 year ago

For Max/MaxBy, Order/OrderBy, etc. the only difference is the predicate parameter, but they still do very similar work. You are trying to introduce a CountBy which does something related, but still very different than Count. A better name would be GroupCountBy which would allow you to also have a GroupCount method that would count duplicates.

eiriktsarpalis commented 1 year ago

This was discussed, however CountBy is a fairly standard name in methods of this sort across ecosystems.