HashSet.GetOrAdd(T) - Githubissues

jnm2 commented 7 years ago

There are many times when there is a custom equality comparison being done but I need access to the original instance.

What I'd like:

var mergedItems = new HashSet<Item>();

foreach (var item in items)
{
    mergedItems.GetOrAdd(item).RelatedInfo.Add(x);
}

public sealed class Item : IEquatable<Item>
{
    public DateTime Date { get; }
    public int Amount { get; }
    public List<int> RelatedInfo { get; }

    public bool Equals(Item other) => other != null && other.Date == Date && other.Amount == Amount;

    public override bool Equals(object obj) => obj is Item item && Equals(item);

    public override int GetHashCode() => // ...
}

Today I'm forced to use a Dictionary which duplicates storage of the key for no reason. A dictionary is also wasteful because it has to do two hash lookups, one for TryGetValue and one for Add:

var mergedItems = new Dictionary<Item, Item>();

foreach (var item in items)
{
    if (!mergedItems.TryGetValue(item, out var otherItem))
        mergedItems.Add(item, otherItem = item);
    otherItem.RelatedInfo.Add(x);
}

This is no better; it's a lot more to maintain, unnecessarily coupled to the definition of equality, it's still duplicate storage of the key, and it's still duplicating the number of hash lookups:

var mergedItems = new Dictionary<(DateTime, int), Item>();

foreach (var item in items)
{
    if (!mergedItems.TryGetValue((item.Date, item.Amount), out var otherItem))
        mergedItems.Add((item.Date, item.Amount), otherItem = item);
    otherItem.RelatedInfo.Add(x);
}

KeyedCollection is the worst of all because in addition to the dictionary, it keeps a list I'll never use.

Proposed API

namespace System.Collections.Generic;

public class HashSet<T>
{
+   /// <summary>
+   /// Searches the set for a given element and returns the equal element it finds, if any;
+   /// otherwise, adds the element to the set.
+   /// </summary>
+   /// <param name="item">The element to search for, or to add to the set.</param>
+   /// <returns>
+   /// The equal element that the search found, or the given element that was added to the set.
+   /// </returns>
+   public T GetOrAdd(T item);
}

jnm2 commented 7 years ago

Similar to https://github.com/dotnet/corefx/issues/11291 but not exactly a duplicate, since I don't want a generic parameter for the key or any other distinctions between keys and values.

svick commented 7 years ago

Today I'm forced to use a Dictionary which duplicates storage of the key for no reason.

You're not, if you're on .Net Core 2.0, you can use HashSet<T>.TryGetValue() (https://github.com/dotnet/corefx/issues/15691), though it still requires two lookups.

On one hand, the addition of TryGetValue indicates that retrieving a value from a HashSet based on a equal, but distinct value makes sense. On the other hand, it means adding GetOrAdd is less important: it improves performance and convenience, but doesn't add new capabilities.

jnm2 commented 7 years ago

Sweet! That was the thread I couldn't find! So implementing GetOrAdd should be a piece of cake. @TylerBrinkley? :D

danmoseley commented 7 years ago

@jnm2 there is a code size cost to adding new API to commonly instantiated generic types which means there is a higher bar for adding them. Consensus we achieved is here https://github.com/dotnet/corefx/issues/17275#issuecomment-291636863 Given this one could be achieved by an extension method, that might be the most palatable approach.,

jnm2 commented 7 years ago

The extension method still has a downside, which is the duplicate hashtable lookup. In that sense, this is a fundamental operation. Too bad it can't be the other way around, other instance methods like Contains as extension methods implemented in terms of the fundamentals.

TylerBrinkley commented 7 years ago

I think this one falls under the first criteria @terrajobst set for #17275 which is

CONSIDER making rare methods on generic types extension methods to avoid forcing the code gen to instantiate all these methods

While there is a performance hit of 2 lookups I don't think this method would be used all that much.

kellypleahy commented 7 years ago

If we're going to do this, can we also have a GetOrAdd that takes the factory function? This is something that seems also quite valuable? Also, can we put these on Dictionary<TKey,TValue> (as extension methods) if they don't already exist too? Is there already a ticket for Dictionary<TKey,TValue>?

danmoseley commented 6 years ago

@jnm2 can you clarify how this is not satisfied by dotnet/runtime#20073 ? I'm missing it.

jnm2 commented 6 years ago

@danmosemsft dotnet/runtime#20073 doesn't provide any way to do GetOrAdd without a duplicate hash lookup, that I can see. It enables us to do this, but in the 'add' scenario, there is duplicate work being done in the calls to TryGetValue and Add:

public static T GetOrAdd<T>(this HashSet<T> hashSet, T equalValue)
{
    if (hashSet.TryGetValue(equalValue, out var actualValue))
    {
        return actualValue;
    }

    hashSet.Add(equalValue);
    return equalValue;
}

@TylerBrinkley makes the point above that the performance hit of these duplicate lookups may not be important enough to add the instance method (requiring a bit more JIT time per generic instantiation).

There is indeed precedent for preferring extension methods at the cost of duplicate hash lookups:

https://github.com/dotnet/corefx/blob/b9ce86fbec92349c203a4c33805cffb8ca24c10f/src/System.Collections/src/System/Collections/Generic/CollectionExtensions.cs#L25-L39

However, see the opposite point being made by @jamesqo in the Dictionary.GetOrAdd issue: https://github.com/dotnet/corefx/issues/2877#issuecomment-319244808

danmoseley commented 6 years ago

Ah yes i forgot we made an extension method. Given that decision it sounds like this issue can be closed? The cost of growing commonly used generic types is the code size. Dictionary is one of the most sensitive, a small change can grow hello world size significantly.

jnm2 commented 6 years ago

@danmosemsft dotnet/runtime#15059 is about Dictionary.GetOrAdd; this one is about HashSet.GetOrAdd. As far as I'm aware there is no extension method for GetOrAdd for either one.

kellypleahy commented 6 years ago

@danmosemsft I'm a bit confused here. Does this really have such a big impact on code size? I think I'd be more worried personally about the performance impact of the second lookup than a tiny bit more code for each generic instantiation, but I guess that's just me (and I guess it really depends on whether people will actually use this instance method on hot code paths). Am I understanding properly that the impact would be an additional JITed function and an additional vtable entry (one each per generic instantiation, not per instance of objects), and the cost to perform the JIT on first use of said function?

danmoseley commented 6 years ago

@jnm2 my bad, there's of course two asks here, one was "a custom equality comparison being done but I need access to the original instance." which seems now satisfied by HashSet<T>.TryGetValue (as already observed above) and the other is to be able to "GetOrAdd", which as you say is not currently satisfied, although on Dictionary<K,V> there is TryAdd it doesn't return the value.

As for the cost of adding to the type itself (in this case, necessary to avoid dual lookups) it is not the JIT time but the code duplication. Dictionary<K,V> is instantiated many times in a typical app - once for reference types, then once for (I guess - not my area of expertise) each sized value type, and that can add up significantly. So several proposed additions to the type were either abandoned or moved to extension methods (which is unfortunate if like in this case it would be less efficient). I would expect this to be a lesser concern for HashSet since it's presumably instantiated for fewer types.

@jkotas can you remind me of the kind of size increase you were quoting for a reasonably sized method addition to Dictionary?

jkotas commented 6 years ago

can you remind me of the kind of size increase you were quoting for a reasonably sized method addition to Dictionary?

I have been looking at the code size vs. benefit. I do not think I have said a universal number on how much it is ok to add.

danmoseley commented 6 years ago

I meant, if I add a new 10-20 line method to Dictionary, as a rule of thumb how much does that increase the size of a typical app. I believe you gave a number but I cannot find it.

eiriktsarpalis commented 4 years ago

I feel like we should close this. Given we're going down the CollectionsMarshal route for #15059, we should probably consider a similar approach for HashSet. cc @GrabYourPitchforks

eiriktsarpalis commented 1 year ago

For the record, using CollectionsMarshal for HashSet was proven to be not viable because you can't do ref returns on keys:

https://github.com/dotnet/runtime/issues/82840#issuecomment-1450463397

That being said, consensus is that a GetOrAdd method for HashSet is not particularly valuable compared to just using HashSet.Add except for the rare scenario where you need to access the actual instance stored in the HashSet:

https://github.com/dotnet/runtime/issues/82875#issuecomment-1452036878

theodorzoulias commented 6 months ago

I think that the proposal has this specific form:

/// <summary>
/// Searches the set for a given element and returns the equal element it finds, if any;
/// otherwise, adds the element to the set.
/// </summary>
/// <param name="item">The element to search for, or to add to the set.</param>
/// <returns>
/// The equal element that the search found, or the given element that was added to the set.
/// </returns>
public T GetOrAdd(T item);

@jnm2 is it possible to include it in your original proposal (first comment)?

jnm2 commented 6 months ago

@theodorzoulias Since this issue is closed, I'm not sure it will make any difference.

theodorzoulias commented 6 months ago

@jnm2 this proposal is about an API that some people would consider fundamental (myself included). In order to give this API a proper chance, the proposal should be as formal as possible. Which means that the signature of the proposed API should be placed in a prominent place inside the proposal. The developers who are coming to this page should be able to see at a first glance what is the API they are voting for.

eiriktsarpalis commented 6 months ago

Today I'm forced to use a Dictionary which duplicates storage of the key for no reason.

A dictionary is also wasteful because it has to do two hash lookups, one for TryGetValue and one for Add:

Does the introduction of CollectionsMarshal.GetValueRefOrAddDefault for dictionaries change the calculus here at all? The proposed GetOrAdd method forces you to create a new T regardless of whether one already exists whereas GetValueRefOrAddDefault does it on-demand with a single lookup. I'll also say that it avoids having to define equality semantics for a type in a way that is clearly only meaningful when used in the context of lookups.

theodorzoulias commented 6 months ago

The proposed GetOrAdd method forces you to create a new T regardless of whether one already exists

This is not a problem if the T is a struct, but if it's a class then you have a point. Allocating a superfluous T object might be more costly than doing two lookups with the TryGetValue+Add. To solve this problem two more GetOrAdd overloads might be needed:

public T GetOrAdd(T item, Func<T> itemFactory);
public T GetOrAdd<TArg>(T item, Func<TArg, T> itemFactory, TArg factoryArgument);

The item will be used only for searching the set, and it can be a cashed (reusable) dummy object with only the "key" state initialized (assuming that the "key" state is mutable). In case the search fails, a new item will be generated by invoking the itemFactory delegate, and then added in the set.

eiriktsarpalis commented 6 months ago

In which case two more GetOrAdd overloads might be needed:

The problem as usual with popular generic types like HashSet<T> is that more methods contribute to larger static footprints. As a general rule we don't add them unless they address a fundamental issue users are facing. Frankly, I don't see a particularly compelling use case here.

theodorzoulias commented 6 months ago

Sure. On the other hand it would be nice if we had an idea about how much the static footprint is increased by each additional method in the HashSet<T> collection. Without having a clue about what the increment is, we can't really respond to this argument.

eiriktsarpalis commented 6 months ago

A rule of thumb is multiplying the size of the added method by the number of generic instantiations (modulo reference types that reuse the same instantiation). It can be measured precisely, for example using a tool like sizoscope in Native AOT artifacts.

theodorzoulias commented 6 months ago

The number of generic instantiations should be a lot smaller for a HashSet<T> than a Dictionary<K,V>. The former has only one generic parameter, while the later has two, and (I assume) there is one generic instantiation for each combination of TKey and TValue. Something that confuses me when talking about generic instantiations is, are we talking about the generic instances of the HashSet<T> in my own program only, or we must also include the generic instances that the .NET runtime creates internally for its own inner workings?

I am not very familiar with the profiling tools that come with .NET, so I hope that someone who does could use them and give us a concrete number about the size cost of each additional method in the HashSet<T> collection.

jkotas commented 6 months ago

are we talking about the generic instances of the HashSet in my own program only, or we must also include the generic instances that the .NET runtime creates internally for its own inner workings?

All instantiations used by the app, the app dependencies and the runtime itself for its inner working.

theodorzoulias commented 6 months ago

@jkotas does the .NET runtime uses the HashSet<T> a lot? If yes, why so? Shouldn't the developers of the runtime use highly optimized internal collections (without null checking, version checking etc), instead of the public collections? This way the evolution of the public collections would not be impeded, by increasing needlessly their generic instantiations.

jkotas commented 6 months ago

The core runtime is not the problem. The problem are all libraries out there, including libraries that happen to live in the dotnet/runtime repo. We try to use the public types, including collections, in the implementation of dotnet/runtime libraries as much as possible. The public collection types have good tradeoff between usability, performance and security, and they are the right choice in vast majority of cases.

Here are some data about popularity of different collection types: https://devblogs.microsoft.com/dotnet/performance-improvements-in-net-8/#list

theodorzoulias commented 6 months ago

@jkotas thanks for the link. According to the diagram, the Dictionary<K,V> is about 3.5 times more popular than the HashSet<T>. This diagram shows the number of references though, not the number of distinct generic instantiations, and I am not sure that we can make an estimation about the later based on the former. Especially if we take into account that the two collections differ in the number of generic arguments. My wild guess is that the Dictionary<K,V> might have 10 times, or more, generic instantiations than the HashSet<T>.

We try to use the public types, including collections, in the implementation of dotnet/runtime libraries as much as possible.

And aren't you worried about the negative impact this decision has on to ability of the public collections to evolve? The more you use them, the less you can improve them. It's like a bad romance story where Romeo says to Giulietta "The more I love you, the more ugly and incomplete I want you to be." I can understand that there can be many reasons to prefer using the public collections over internal ones (less code duplication, faster discovery of bugs etc), but the negative effects are becoming worse and worse. Each new API that is approved makes all other API proposals less likely to be approved.

The HashSet<T> is equipped with quite a lot of specialized methods that most likely are not used too often. Like the IsProperSubsetOf, IsProperSupersetOf, SymmetricExceptWith, UnionWith etc. Could these methods be moved in an auxiliary internal class, so that they are not instantiated generically unless there is code that actually uses them? This way we could make some space for some new useful methods to be added. The GetOrAdd is such a method, because it allows to do efficiently a fundamental operation on the collection. I think that most people would agree that in an ideal world, all fundamental operations in hashed collections would involve a single hashing of the key.

jkotas commented 6 months ago

And aren't you worried about the negative impact this decision has on to ability of the public collection to evolve?

I do not think that dotnet/runtime libraries are particularly special. I think that it is an anti-pattern to do one thing in dotnet/runtime libraries and tell other library developers to do a different thing.

Each new API that is approved makes all other API proposals less likely to be approved.

This is true in general, even outside this issue. Every type in the BCL can only bear finite level of complexity. Each new API on a given type eats the complexity budget that makes the additional API less likely to be approved.

theodorzoulias commented 6 months ago

Every type in the BCL can only bear finite level of complexity. Each new API on a given type eats the complexity budget that makes the additional API less likely to be approved.

Sure. Each new API makes all other existing APIs less discoverable, increases the mental burden for the developer, increases the size of the documentation, elongates the vertical scrollbar in the intellisence etc. These are valid reasons to be conservative when it comes with approving new APIs. But the size of the static footprint of applications sounds like a technicality that shouldn't affect the decision process. And frankly it's quite depressing that it is, and it comes so often on the table, and it impacts the decisions so greatly.

jnm2 commented 6 months ago

@theodorzoulias I edited it in.

CyrusNajmabadi commented 6 months ago

But the size of the static footprint of applications sounds like a technicality

API design is a technical process. :)

If performance and technical issues are not a concern, the then suggestion to use a dictionary here is entirely valid. But just as a dictionary adds costs some people aren't willing to bear, the runtime similarly has even more intense performance concerns it is impacted by when considering solutions here.

eiriktsarpalis commented 6 months ago

But the size of the static footprint of applications sounds like a technicality

The obvious solution would be to add the method on CollectionsMarshal, which was created precisely to address the size concern. Even so, I don't see a particularly strong use case for this method for the reasons mentioned earlier.

theodorzoulias commented 6 months ago

@eiriktsarpalis having to do two hashings and two lookups for an operation that is known from other collections to be singular and atomic, is a pretty good indicator that something is missing from the public API of the HashSet<T> collection.

The GetOrAdd method is part of the ConcurrentDictionary<K,V> collection, and it's very useful there. It's not part of the Dictionary<K,V>, although it would be very useful as well, because of the static size concerns, but at least it can be implemented efficiently via the CollectionsMarshal route. It's not part of the HashSet<T>, because it would be useful in far fewer scenarios, but its presence would be very much appreciated in these specialized scenarios. It can't be implemented efficiently, because there is no CollectionsMarshal support for the HashSet<T>. Do you think that we should revive the idea of adding CollectionsMarshal support for the HashSet<T>, although it would be trivial for the developers to misuse this API and shoot themselves in the foot? Personally I am OK with this idea, because I think that if a developer is brave enough to venture the obscure System.Runtime.InteropServices namespace, they are probably also careful enough to use the sensitive APIs correctly.

Would you consider unlocking and reopening the #82840 issue?

eiriktsarpalis commented 6 months ago

Do you think that we should https://github.com/dotnet/runtime/issues/82840 of adding CollectionsMarshal support for the HashSet, although it would be trivial for the developers to misuse this API and shoot themselves in the foot?

It's not an issue of the API being potentially misused (which is a known concern for most CollectionsMarshal APIs) but that it is fundamentally flawed: the returned ref T value represents the key of the hash table, so making any nontrivial mutations to it is bound to corrupt the collection.

eiriktsarpalis commented 6 months ago

Let me try to rephrase what I believe is problematic with this group of proposals:

When modelling a HashSet<T> where T has a certain definition of equality it is good practice to regard t1 : T and t2 : T as interchangeable (modulo mutability concerns) whenever t1.Equals(t2). This invariant is observed in all definitions of equality that the framework ships regardless of whether it uses reference or structural equality. Because T is the key of HashSet<T> you can't actually look it up unless you already have a T. But if you already have the T then

There is no need to bother with delayed initialization factories such as the ones declared here and
There is no need to retrieve the underlying stored T if it exists because the two should be interchangeable. This implies that bool Contains(T item) and bool Add(T item) are the only APIs needed to interrogate the hash set.

In my opinion, types whose equality doesn't imply interchangeability gives a strong indication that the values should be indexed by a different key type using a dictionary.

theodorzoulias commented 6 months ago

2. There is no need to retrieve the underlying stored T if it exists because the two should be interchangeable. This implies that bool Contains(T item) and bool Add(T item) are the only APIs needed to interrogate the hash set.

Alas the HashSet<T> is already equipped with the method TryGetValue:

public bool TryGetValue (T equalValue, out T actualValue);

The mere existence of this API approves creative uses of the collection, where two Ts that are equal for the purpose of searching them are not equal for other purposes. The API proposed here, GetOrAdd, builds on this precedence.

It's not an issue of the API being potentially misused (which is a known concern for most CollectionsMarshal APIs) but that it is fundamentally flawed: the returned ref T value represents the key of the hash table, so making any nontrivial mutations to it is bound to corrupt the collection.

We are saying the same thing. The point is, can the CollectionsMarshal APIs proposed in #82840 be used effectively for implementing the GetOrAdd, AddOrUpdate, TryUpdate or other advanced APIs that some developers need for their specialized purposes? I think that they do. If the #82840 is reopened I'm probably going to replace the unfortunate "API Usage" that I posted initially with an efficient GetOrAdd implementation, and we can continue the discussion there.

eiriktsarpalis commented 6 months ago

The mere existence of this API approves creative uses of the collection

I don't think the presence of an API that somehow got shipped in the past refutes the point I'm trying to make. Mistakes happen from time to time, it shouldn't be carte blanche for us to follow up with more mistakes.

If the https://github.com/dotnet/runtime/issues/82840 is reopened I'm probably going to replace the unfortunate "API Usage" that I posted initially with an efficient GetOrAdd implementation, and we can continue the discussion there.

Which GetOrAdd implementation are you referring to?

theodorzoulias commented 6 months ago

I don't think the presence of an API that somehow got shipped in the past refutes the point I'm trying to make. Mistakes happen from time to time, it shouldn't be carte blanche for us to follow up with more mistakes.

Of course it does, unless you can prove that approving the TryGetValue was a mistake. That API was proposed in the #20073, and the approval procedure included 11 participants. Proving that all these people participated in a mistake is not a trivial task! 🙂

Which GetOrAdd implementation are you referring to?

The one that I'll try to come up with, if the #82840 is reopened. Currently it exists nowhere. Would you prefer to post it here instead?

theodorzoulias commented 6 months ago

@eiriktsarpalis the moment that purity went out of the window was when the HashSet<T> obtained the ability to compare its contents for equality using a non-default comparer, in other words when the constructor HashSet<T>(IEqualityComparer<T>))) was added to the collection. This constructor was added in .NET Framework 3.5, which is the same release that introduced the HashSet<T> collection. So the collection was never pure. The developers were able to use non-standard equality definitions from day one. The notion "two Ts that are equal by one definition are equal by all definitions" never existed for the HashSet<T>.

kellypleahy commented 6 months ago

Let me try to rephrase what I believe is problematic with this group of proposals:

When modelling a HashSet<T> where T has a certain definition of equality it is good practice to regard t1 : T and t2 : T as interchangeable (modulo mutability concerns) whenever t1.Equals(t2). This invariant is observed in all definitions of equality that the framework ships regardless of whether it uses reference or structural equality. Because T is the key of HashSet<T> you can't actually look it up unless you already have a T. But if you already have the T then

There is no need to bother with delayed initialization factories such as the ones declared here and

There is no need to retrieve the underlying stored T if it exists because the two should be interchangeable. This implies that bool Contains(T item) and bool Add(T item) are the only APIs needed to interrogate the hash set.

In my opinion, types whose equality doesn't imply interchangeability gives a strong indication that the values should be indexed by a different key type using a dictionary.

I think this is a very narrow view of the HashSet value. I've personally used HashSet many times where you want to "deduplicate" a bunch of reference objects that are "identical" in semantic meaning but occupying different memory. This is precisely the use case where GetOrAdd would be valuable and would greatly improve the performance. That said, I'm not sure I understand why TryGetValue doesn't do what is needed.

eiriktsarpalis commented 6 months ago

Proving that all these people participated in a mistake is not a trivial task! 🙂

I don't intend to, this is a discussion about adding a new API and not removing and old API 🙂

The notion "two Ts that are equal by one definition are equal by all definitions" never existed for the HashSet<T>.

I never claimed that types have a unique notion of equality, what I did say is that under a given definition of equality two equal values should be considered equivalent. IEqualityComparer<T> is a tool that lets you override built-in equality, but it shouldn't invalidate that principle (in practice this can be abused).

I've personally used HashSet many times where you want to "deduplicate" a bunch of reference objects that are "identical" in semantic meaning but occupying different memory.

Same here, but the mere act of deduplication implies that you're happy to discard references up to that notion of equality, which suggests that these references are equivalent in your domain.

This is precisely the use case where GetOrAdd would be valuable and would greatly improve the performance.

How would a GetOrAdd method improve performance in a deduplication method?

theodorzoulias commented 6 months ago

Proving that all these people participated in a mistake is not a trivial task! 🙂

I don't intend to, this is a discussion about adding a new API and not removing and old API 🙂

@eiriktsarpalis since you are arguing against the proposed GetOrAdd API from a theoretical/philosophical/academic standpoint, and since this API is building on the precedence of the existing TryGetValue API, I think that it would be intellectually consistent to argue against both APIs. Either both are mistakes, or none. If you think that the TryGetValue was approved incorrectly, it's still possible to mitigate this mistake by obsoleting the API, hiding it from the tools, updating the documentation with warnings against using it etc.

I think that the GetOrAdd could be implemented with the help of the CollectionsMarshal like this:

public static T GetOrAdd<T>(this HashSet<T> source, T item)
{
    ref T itemRef = ref CollectionsMarshal.GetItemRefOrAdd(source, item, out bool exists);
    // In case the item is not found, it has been added to the set and assigned to the itemRef.
    return itemRef;
}

Although the CollectionsMarshal.GetItemRefOrAdd (as proposed in #82840) is unsafe, the above GetOrAdd implementation is not. The caller of the GetOrAdd cannot misuse it in a way that would corrupt the HashSet<T>. The same cannot be said for the factory-based GetOrAdd:

public static T GetOrAdd<T, TArg>(this HashSet<T> source, T item, Func<TArg, T> itemFactory, TArg factoryArgument)
{
    ref T itemRef = ref CollectionsMarshal.GetItemRefOrAdd(source, item, out bool exists);
    if (!exists)
    {
        item = itemFactory(factoryArgument);
        Debug.Assert(source.Comparer.Equals(item, itemRef));
        itemRef = item;
    }
    return itemRef;
}

This one transfers the responsibility of correctness to the caller. It's up to the itemFactory to return a T equal to the original T. Checking for equality in release mode would defeat the whole purpose of this API, which is to have comparable performance with a single call to TryGetValue or a single call to Add.

kellypleahy commented 6 months ago

Although the CollectionsMarshal.GetItemRefOrAdd (as proposed in #82840) is unsafe, the above GetOrAdd implementation is not. The caller of the GetOrAdd cannot misuse it in a way that would corrupt the HashSet<T>. The same cannot be said for the factory-based GetOrAdd:
public static T GetOrAdd<T, TArg>(this HashSet<T> source, T item, Func<TArg, T> itemFactory, TArg factoryArgument)
{
    ref T itemRef = ref CollectionsMarshal.GetItemRefOrAdd(source, item, out bool exists);
    if (!exists)
    {
        item = itemFactory(factoryArgument);
        Debug.Assert(source.Comparer.Equals(item, itemRef));
        Debug.Assert(source.Comparer.GetHashCode(item) == source.Comparer.GetHashCode(itemRef));
        itemRef = item;
    }
    return itemRef;
}
This one transfers the responsibility of correctness to the caller. It's up to the itemFactory to return a T equal to the original T. Checking for equality in release mode would defeat the whole purpose of this API, which is to avoid hashing the T element more than once.

I cannot for the life of me understand why the Factory version of this GetOrAdd for HashSet would be useful. If you've already constructed the T, why do you need a factory to help you create ANOTHER one? I totally understand the value of GetOrAdd(item) though, as it serves the purpose I mentioned earlier.

@eiriktsarpalis - the performance improvement for a GetOrAdd in the dedup situation is that you don't "search for it" then "add it if you didn't find it" in two steps since these are two hashes and searches. Similar to the argument for the same thing for Dictionary.

theodorzoulias commented 6 months ago

@kellypleahy the purpose of the factory-based GetOrAdd is to avoid an allocation in case the T is reference type. For context see this previous comment.

TylerBrinkley commented 6 months ago

I think the difference here with TryGetValue vs GetOrAdd is that TryGetValue unlocked functionality that was not possible before but this is only for performance and convenience reasons which doesn't seem as compelling given the static footprint costs.

kellypleahy commented 6 months ago

@kellypleahy the purpose of the factory-based GetOrAdd is to avoid an allocation in case the T is reference type. For context see this previous comment.

Hmm... that one seems really bad to me, but I guess that's just my opinion. I'm 100% +1 on GetOrAdd(item). Much more dubious on GetOrAdd<TArg>(item, Func<TArg, T> factory, TArg factoryArg). I can only imagine that being confusing at best. If you really need to "initialize" the object that you passed to item, I think it should be reasonably easy for you to know that the item you passed was the one added (maybe object.ReferenceEquals() or something). I do see the use case you are talking about, but I feel like it would be mostly just confusing to callers and handles a VERY rare case and may not even provide a performance boost at all over the double lookups/insert (I'm not sure I have the brainpower to confirm or deny a benefit at the moment).

theodorzoulias commented 6 months ago

@TylerBrinkley in case you have any idea about what is the static footprint cost of adding a method in the HashSet<T> collection, please share it because I would really love to know. Otherwise, how are you able to make a judgment? On the one hand we have performance and convenience, and on the other hand we have a mystery cost. That's hardly enough data for judging anything.

dotnet / runtime

HashSet.GetOrAdd(T) #21603

Proposed API