dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.45k stars 4.76k forks source link

[API Proposal]: The Except and ExceptBy methods #110121

Closed Mr0N closed 3 days ago

Mr0N commented 3 days ago

Background and motivation

I propose adding an overload to the Except and ExceptBy methods that allows for removing elements from the provided array without removing duplicates.

API Proposal

IEnumerable<TSource> ExceptIterator<TSource>(IEnumerable<TSource> first, IEnumerable<TSource> second)
{
    var set = new HashSet<TSource>(second);

    foreach (TSource element in first)
    {

        if (!set.Contains(element))
        {
            yield return element;
        }
    }
}

API Usage

using System.ComponentModel;
using System.Linq;

var range = new int[] { 1 };
var rangeEx = new int[] { 1, 2, 3, 4, 5 ,2,5};

var result = rangeEx.ExceptBy(range, a => a).ToArray();
Write(result);
var write = rangeEx.Except(range);
Write(write.ToArray());

var iter = ExceptIterator(rangeEx, range);
Write(iter.ToArray());
Console.ReadKey();

void Write(params int[] values)
{
    Console.WriteLine($"Array:{string.Join(',',values)}");
}
IEnumerable<TSource> ExceptIterator<TSource>(IEnumerable<TSource> first, IEnumerable<TSource> second)
{
    var set = new HashSet<TSource>(second);

    foreach (TSource element in first)
    {

        if (!set.Contains(element))
        {
            yield return element;
        }
    }
}

Array:2,3,4,5 Array:2,3,4,5 Array:2,3,4,5,2,5

Alternative Designs

No response

Risks

No response

dotnet-policy-service[bot] commented 3 days ago

Tagging subscribers to this area: @dotnet/area-system-linq See info in area-owners.md if you want to be subscribed.

Clockwork-Muse commented 3 days ago

nit: Except (along with Union and Intersect) are terms that come from set math, so any method wouldn't/shouldn't be named this way.

For the rare cases where something like this is needed, most people can probably use the following extension method:

IEnumerable<TSource> WhereNotIn<TSource>(this IEnumerable<TSource> first, IEnumerable<TSource> second)
{
    var set = new HashSet<TSource>(second);

    return first.Where(e => !set.Contains(e));
}
eiriktsarpalis commented 3 days ago

Duplicate of #107338, https://github.com/dotnet/runtime/issues/75066, and https://github.com/dotnet/runtime/issues/102112.

Clockwork-Muse commented 3 days ago

@eiriktsarpalis - This isn't complaining about the existing behavior, so not a strict duplicate?

Mr0N commented 3 days ago

I’m not suggesting changing the logic of the existing method, but rather adding an overload where you can choose whether to remove duplicates. For example, it could be implemented like this:

public static IEnumerable<TSource> Except<TSource>(this IEnumerable<TSource> first, IEnumerable<TSource> second, bool removeDuplicate = true)

`

eiriktsarpalis commented 3 days ago

Adding boolean flags that fundamentally change the semantics of a function isn't considered good practice, and is definitely not something we've done before in LINQ. For better or for worse Intersect and Except reflect the semantics of SQL, and it would be more appropriate to consider separate method groups for that functionality. @Clockwork-Muse's use of WhereIn and WhereNotIn seems more appropriate in that regard, but I think these would need to be filed as a distinct proposal.

Mr0N commented 3 days ago

Adding boolean flags that fundamentally change the semantics of a function isn't considered good practice, and is definitely not something we've done before in LINQ. For better or for worse Intersect and Except reflect the semantics of SQL, and it would be more appropriate to consider separate method groups for that functionality. @Clockwork-Muse's use of WhereIn and WhereNotIn seems more appropriate in that regard, but I think these would need to be filed as a distinct proposal.

Well, yes, the SQL method is more understandable, but for C#, it's not immediately clear how to use this method. In SQL, you can use a LEFT JOIN to quickly find the difference between two tables, but in C#, it's quite difficult to achieve, especially in terms of performance.

It would be great if there were some method that could work between two collections and leverage all possible optimizations to speed up the process, similar to what SQL offers.

Mr0N commented 3 days ago

Without losing extra data

Clockwork-Muse commented 2 days ago

but in C#, it's quite difficult to achieve, especially in terms of performance

It would be great if there were some method that could work between two collections and leverage all possible optimizations to speed up the process, similar to what SQL offers.

This is mostly because RDBMSs put a lot more work into their dynamic optimizers than would be reasonable for C#. If you're dealing with a large enough dataset to actually affect program performance, stick it into an actual database (especially because chances are you're going to want to do more things with it).


In SQL, you can use a LEFT JOIN to quickly find the difference between two tables

... I really wish (the iSeries version of) DB2's EXCEPTION JOIN would be added to the standard....

Mr0N commented 1 day ago

It would be good to implement such an algorithm in C#, and then use it to separate one collection from another. It seems like a complex algorithm at first glance.

https://en.wikipedia.org/wiki/Sort-merge_join

chat gpt:

Merge Join (Merging Join)

The Merge Join is an algorithm used for efficiently executing a JOIN operation, especially when both tables are already sorted by the column(s) involved in the join. Its key advantage lies in its linear traversal of rows, making it highly performant for large, sorted datasets.


How Does Merge Join Work?

Input Requirements:

Data Comparison:

  1. The algorithm starts with the first row of each table.
  2. It compares the join keys:
    • If the key in one table is smaller, the pointer for that table advances to the next row.
    • If the keys match, a result row is produced, and both pointers advance.
    • This continues until all rows are processed.

Example

Table A:

ID | Name -- | -- 1 | John 2 | Alice 3 | Bob

This behavior is crucial for accurate join results but may increase the size of the output significantly.


Advantages of Merge Join

  1. Efficient for Sorted Data: Works linearly, O(n+m), when data is pre-sorted.
  2. Scalable for Large Datasets: Ideal for large tables with indexes on join columns.
  3. Low Memory Overhead: Requires minimal additional memory compared to Hash Join.

Limitations

  1. Sorting Overhead: If the input data isn’t sorted, the pre-sorting step can be expensive, O(nlogn).
  2. Equality-Only Conditions: Only supports joins based on equality (=). For conditions like >, <, or !=, other join algorithms are needed.