Closed jnm2 closed 3 years ago
Similar to https://github.com/dotnet/corefx/issues/11291 but not exactly a duplicate, since I don't want a generic parameter for the key or any other distinctions between keys and values.
Today I'm forced to use a Dictionary which duplicates storage of the key for no reason.
You're not, if you're on .Net Core 2.0, you can use HashSet<T>.TryGetValue()
(https://github.com/dotnet/corefx/issues/15691), though it still requires two lookups.
On one hand, the addition of TryGetValue
indicates that retrieving a value from a HashSet
based on a equal, but distinct value makes sense. On the other hand, it means adding GetOrAdd
is less important: it improves performance and convenience, but doesn't add new capabilities.
Sweet! That was the thread I couldn't find! So implementing GetOrAdd should be a piece of cake. @TylerBrinkley? :D
@jnm2 there is a code size cost to adding new API to commonly instantiated generic types which means there is a higher bar for adding them. Consensus we achieved is here https://github.com/dotnet/corefx/issues/17275#issuecomment-291636863 Given this one could be achieved by an extension method, that might be the most palatable approach.,
The extension method still has a downside, which is the duplicate hashtable lookup. In that sense, this is a fundamental operation. Too bad it can't be the other way around, other instance methods like Contains as extension methods implemented in terms of the fundamentals.
I think this one falls under the first criteria @terrajobst set for #17275 which is
- CONSIDER making rare methods on generic types extension methods to avoid forcing the code gen to instantiate all these methods
While there is a performance hit of 2 lookups I don't think this method would be used all that much.
If we're going to do this, can we also have a GetOrAdd
that takes the factory function? This is something that seems also quite valuable? Also, can we put these on Dictionary<TKey,TValue>
(as extension methods) if they don't already exist too? Is there already a ticket for Dictionary<TKey,TValue>
?
@jnm2 can you clarify how this is not satisfied by dotnet/runtime#20073 ? I'm missing it.
@danmosemsft dotnet/runtime#20073 doesn't provide any way to do GetOrAdd
without a duplicate hash lookup, that I can see. It enables us to do this, but in the 'add' scenario, there is duplicate work being done in the calls to TryGetValue
and Add
:
public static T GetOrAdd<T>(this HashSet<T> hashSet, T equalValue)
{
if (hashSet.TryGetValue(equalValue, out var actualValue))
{
return actualValue;
}
hashSet.Add(equalValue);
return equalValue;
}
@TylerBrinkley makes the point above that the performance hit of these duplicate lookups may not be important enough to add the instance method (requiring a bit more JIT time per generic instantiation).
There is indeed precedent for preferring extension methods at the cost of duplicate hash lookups:
However, see the opposite point being made by @jamesqo in the Dictionary.GetOrAdd issue: https://github.com/dotnet/corefx/issues/2877#issuecomment-319244808
Ah yes i forgot we made an extension method. Given that decision it sounds like this issue can be closed? The cost of growing commonly used generic types is the code size. Dictionary is one of the most sensitive, a small change can grow hello world size significantly.
@danmosemsft dotnet/runtime#15059 is about Dictionary.GetOrAdd; this one is about HashSet.GetOrAdd. As far as I'm aware there is no extension method for GetOrAdd for either one.
@danmosemsft I'm a bit confused here. Does this really have such a big impact on code size? I think I'd be more worried personally about the performance impact of the second lookup than a tiny bit more code for each generic instantiation, but I guess that's just me (and I guess it really depends on whether people will actually use this instance method on hot code paths). Am I understanding properly that the impact would be an additional JITed function and an additional vtable entry (one each per generic instantiation, not per instance of objects), and the cost to perform the JIT on first use of said function?
@jnm2 my bad, there's of course two asks here, one was "a custom equality comparison being done but I need access to the original instance." which seems now satisfied by HashSet<T>.TryGetValue
(as already observed above) and the other is to be able to "GetOrAdd", which as you say is not currently satisfied, although on Dictionary<K,V>
there is TryAdd
it doesn't return the value.
As for the cost of adding to the type itself (in this case, necessary to avoid dual lookups) it is not the JIT time but the code duplication. Dictionary<K,V>
is instantiated many times in a typical app - once for reference types, then once for (I guess - not my area of expertise) each sized value type, and that can add up significantly. So several proposed additions to the type were either abandoned or moved to extension methods (which is unfortunate if like in this case it would be less efficient). I would expect this to be a lesser concern for HashSet since it's presumably instantiated for fewer types.
@jkotas can you remind me of the kind of size increase you were quoting for a reasonably sized method addition to Dictionary?
can you remind me of the kind of size increase you were quoting for a reasonably sized method addition to Dictionary?
I have been looking at the code size vs. benefit. I do not think I have said a universal number on how much it is ok to add.
I meant, if I add a new 10-20 line method to Dictionary, as a rule of thumb how much does that increase the size of a typical app. I believe you gave a number but I cannot find it.
I feel like we should close this. Given we're going down the CollectionsMarshal route for #15059, we should probably consider a similar approach for HashSet. cc @GrabYourPitchforks
For the record, using CollectionsMarshal for HashSet was proven to be not viable because you can't do ref returns on keys:
https://github.com/dotnet/runtime/issues/82840#issuecomment-1450463397
That being said, consensus is that a GetOrAdd
method for HashSet
is not particularly valuable compared to just using HashSet.Add
except for the rare scenario where you need to access the actual instance stored in the HashSet:
https://github.com/dotnet/runtime/issues/82875#issuecomment-1452036878
I think that the proposal has this specific form:
/// <summary>
/// Searches the set for a given element and returns the equal element it finds, if any;
/// otherwise, adds the element to the set.
/// </summary>
/// <param name="item">The element to search for, or to add to the set.</param>
/// <returns>
/// The equal element that the search found, or the given element that was added to the set.
/// </returns>
public T GetOrAdd(T item);
@jnm2 is it possible to include it in your original proposal (first comment)?
@theodorzoulias Since this issue is closed, I'm not sure it will make any difference.
@jnm2 this proposal is about an API that some people would consider fundamental (myself included). In order to give this API a proper chance, the proposal should be as formal as possible. Which means that the signature of the proposed API should be placed in a prominent place inside the proposal. The developers who are coming to this page should be able to see at a first glance what is the API they are voting for.
Today I'm forced to use a Dictionary which duplicates storage of the key for no reason.
A dictionary is also wasteful because it has to do two hash lookups, one for TryGetValue and one for Add:
Does the introduction of CollectionsMarshal.GetValueRefOrAddDefault
for dictionaries change the calculus here at all? The proposed GetOrAdd
method forces you to create a new T
regardless of whether one already exists whereas GetValueRefOrAddDefault
does it on-demand with a single lookup. I'll also say that it avoids having to define equality semantics for a type in a way that is clearly only meaningful when used in the context of lookups.
The proposed
GetOrAdd
method forces you to create a newT
regardless of whether one already exists
This is not a problem if the T
is a struct, but if it's a class
then you have a point. Allocating a superfluous T
object might be more costly than doing two lookups with the TryGetValue
+Add
. To solve this problem two more GetOrAdd
overloads might be needed:
public T GetOrAdd(T item, Func<T> itemFactory);
public T GetOrAdd<TArg>(T item, Func<TArg, T> itemFactory, TArg factoryArgument);
The item
will be used only for searching the set, and it can be a cashed (reusable) dummy object with only the "key" state initialized (assuming that the "key" state is mutable). In case the search fails, a new item will be generated by invoking the itemFactory
delegate, and then added in the set.
In which case two more GetOrAdd overloads might be needed:
The problem as usual with popular generic types like HashSet<T>
is that more methods contribute to larger static footprints. As a general rule we don't add them unless they address a fundamental issue users are facing. Frankly, I don't see a particularly compelling use case here.
Sure. On the other hand it would be nice if we had an idea about how much the static footprint is increased by each additional method in the HashSet<T>
collection. Without having a clue about what the increment is, we can't really respond to this argument.
A rule of thumb is multiplying the size of the added method by the number of generic instantiations (modulo reference types that reuse the same instantiation). It can be measured precisely, for example using a tool like sizoscope in Native AOT artifacts.
The number of generic instantiations should be a lot smaller for a HashSet<T>
than a Dictionary<K,V>
. The former has only one generic parameter, while the later has two, and (I assume) there is one generic instantiation for each combination of TKey
and TValue
. Something that confuses me when talking about generic instantiations is, are we talking about the generic instances of the HashSet<T>
in my own program only, or we must also include the generic instances that the .NET runtime creates internally for its own inner workings?
I am not very familiar with the profiling tools that come with .NET, so I hope that someone who does could use them and give us a concrete number about the size cost of each additional method in the HashSet<T>
collection.
are we talking about the generic instances of the HashSet
in my own program only, or we must also include the generic instances that the .NET runtime creates internally for its own inner workings?
All instantiations used by the app, the app dependencies and the runtime itself for its inner working.
@jkotas does the .NET runtime uses the HashSet<T>
a lot? If yes, why so? Shouldn't the developers of the runtime use highly optimized internal collections (without null checking, version checking etc), instead of the public collections? This way the evolution of the public collections would not be impeded, by increasing needlessly their generic instantiations.
The core runtime is not the problem. The problem are all libraries out there, including libraries that happen to live in the dotnet/runtime repo. We try to use the public types, including collections, in the implementation of dotnet/runtime libraries as much as possible. The public collection types have good tradeoff between usability, performance and security, and they are the right choice in vast majority of cases.
Here are some data about popularity of different collection types: https://devblogs.microsoft.com/dotnet/performance-improvements-in-net-8/#list
@jkotas thanks for the link. According to the diagram, the Dictionary<K,V>
is about 3.5 times more popular than the HashSet<T>
. This diagram shows the number of references though, not the number of distinct generic instantiations, and I am not sure that we can make an estimation about the later based on the former. Especially if we take into account that the two collections differ in the number of generic arguments. My wild guess is that the Dictionary<K,V>
might have 10 times, or more, generic instantiations than the HashSet<T>
.
We try to use the public types, including collections, in the implementation of dotnet/runtime libraries as much as possible.
And aren't you worried about the negative impact this decision has on to ability of the public collections to evolve? The more you use them, the less you can improve them. It's like a bad romance story where Romeo says to Giulietta "The more I love you, the more ugly and incomplete I want you to be." I can understand that there can be many reasons to prefer using the public collections over internal ones (less code duplication, faster discovery of bugs etc), but the negative effects are becoming worse and worse. Each new API that is approved makes all other API proposals less likely to be approved.
The HashSet<T>
is equipped with quite a lot of specialized methods that most likely are not used too often. Like the IsProperSubsetOf
, IsProperSupersetOf
, SymmetricExceptWith
, UnionWith
etc. Could these methods be moved in an auxiliary internal class, so that they are not instantiated generically unless there is code that actually uses them? This way we could make some space for some new useful methods to be added. The GetOrAdd
is such a method, because it allows to do efficiently a fundamental operation on the collection. I think that most people would agree that in an ideal world, all fundamental operations in hashed collections would involve a single hashing of the key.
And aren't you worried about the negative impact this decision has on to ability of the public collection to evolve?
I do not think that dotnet/runtime libraries are particularly special. I think that it is an anti-pattern to do one thing in dotnet/runtime libraries and tell other library developers to do a different thing.
Each new API that is approved makes all other API proposals less likely to be approved.
This is true in general, even outside this issue. Every type in the BCL can only bear finite level of complexity. Each new API on a given type eats the complexity budget that makes the additional API less likely to be approved.
Every type in the BCL can only bear finite level of complexity. Each new API on a given type eats the complexity budget that makes the additional API less likely to be approved.
Sure. Each new API makes all other existing APIs less discoverable, increases the mental burden for the developer, increases the size of the documentation, elongates the vertical scrollbar in the intellisence etc. These are valid reasons to be conservative when it comes with approving new APIs. But the size of the static footprint of applications sounds like a technicality that shouldn't affect the decision process. And frankly it's quite depressing that it is, and it comes so often on the table, and it impacts the decisions so greatly.
@theodorzoulias I edited it in.
But the size of the static footprint of applications sounds like a technicality
API design is a technical process. :)
If performance and technical issues are not a concern, the then suggestion to use a dictionary here is entirely valid. But just as a dictionary adds costs some people aren't willing to bear, the runtime similarly has even more intense performance concerns it is impacted by when considering solutions here.
But the size of the static footprint of applications sounds like a technicality
The obvious solution would be to add the method on CollectionsMarshal
, which was created precisely to address the size concern. Even so, I don't see a particularly strong use case for this method for the reasons mentioned earlier.
@eiriktsarpalis having to do two hashings and two lookups for an operation that is known from other collections to be singular and atomic, is a pretty good indicator that something is missing from the public API of the HashSet<T>
collection.
The GetOrAdd
method is part of the ConcurrentDictionary<K,V>
collection, and it's very useful there. It's not part of the Dictionary<K,V>
, although it would be very useful as well, because of the static size concerns, but at least it can be implemented efficiently via the CollectionsMarshal
route. It's not part of the HashSet<T>
, because it would be useful in far fewer scenarios, but its presence would be very much appreciated in these specialized scenarios. It can't be implemented efficiently, because there is no CollectionsMarshal
support for the HashSet<T>
. Do you think that we should revive the idea of adding CollectionsMarshal
support for the HashSet<T>
, although it would be trivial for the developers to misuse this API and shoot themselves in the foot? Personally I am OK with this idea, because I think that if a developer is brave enough to venture the obscure System.Runtime.InteropServices
namespace, they are probably also careful enough to use the sensitive APIs correctly.
Would you consider unlocking and reopening the #82840 issue?
Do you think that we should https://github.com/dotnet/runtime/issues/82840 of adding CollectionsMarshal support for the HashSet
, although it would be trivial for the developers to misuse this API and shoot themselves in the foot?
It's not an issue of the API being potentially misused (which is a known concern for most CollectionsMarshal
APIs) but that it is fundamentally flawed: the returned ref T
value represents the key of the hash table, so making any nontrivial mutations to it is bound to corrupt the collection.
Let me try to rephrase what I believe is problematic with this group of proposals:
When modelling a HashSet<T>
where T
has a certain definition of equality it is good practice to regard t1 : T
and t2 : T
as interchangeable (modulo mutability concerns) whenever t1.Equals(t2)
. This invariant is observed in all definitions of equality that the framework ships regardless of whether it uses reference or structural equality. Because T
is the key of HashSet<T>
you can't actually look it up unless you already have a T
. But if you already have the T
then
T
if it exists because the two should be interchangeable. This implies that bool Contains(T item)
and bool Add(T item)
are the only APIs needed to interrogate the hash set.In my opinion, types whose equality doesn't imply interchangeability gives a strong indication that the values should be indexed by a different key type using a dictionary.
2. There is no need to retrieve the underlying stored
T
if it exists because the two should be interchangeable. This implies thatbool Contains(T item)
andbool Add(T item)
are the only APIs needed to interrogate the hash set.
Alas the HashSet<T>
is already equipped with the method TryGetValue
:
public bool TryGetValue (T equalValue, out T actualValue);
The mere existence of this API approves creative uses of the collection, where two T
s that are equal for the purpose of searching them are not equal for other purposes. The API proposed here, GetOrAdd
, builds on this precedence.
It's not an issue of the API being potentially misused (which is a known concern for most
CollectionsMarshal
APIs) but that it is fundamentally flawed: the returnedref T
value represents the key of the hash table, so making any nontrivial mutations to it is bound to corrupt the collection.
We are saying the same thing. The point is, can the CollectionsMarshal
APIs proposed in #82840 be used effectively for implementing the GetOrAdd
, AddOrUpdate
, TryUpdate
or other advanced APIs that some developers need for their specialized purposes? I think that they do. If the #82840 is reopened I'm probably going to replace the unfortunate "API Usage" that I posted initially with an efficient GetOrAdd
implementation, and we can continue the discussion there.
The mere existence of this API approves creative uses of the collection
I don't think the presence of an API that somehow got shipped in the past refutes the point I'm trying to make. Mistakes happen from time to time, it shouldn't be carte blanche for us to follow up with more mistakes.
If the https://github.com/dotnet/runtime/issues/82840 is reopened I'm probably going to replace the unfortunate "API Usage" that I posted initially with an efficient GetOrAdd implementation, and we can continue the discussion there.
Which GetOrAdd
implementation are you referring to?
I don't think the presence of an API that somehow got shipped in the past refutes the point I'm trying to make. Mistakes happen from time to time, it shouldn't be carte blanche for us to follow up with more mistakes.
Of course it does, unless you can prove that approving the TryGetValue
was a mistake. That API was proposed in the #20073, and the approval procedure included 11 participants. Proving that all these people participated in a mistake is not a trivial task! 🙂
Which
GetOrAdd
implementation are you referring to?
The one that I'll try to come up with, if the #82840 is reopened. Currently it exists nowhere. Would you prefer to post it here instead?
@eiriktsarpalis the moment that purity went out of the window was when the HashSet<T>
obtained the ability to compare its contents for equality using a non-default comparer, in other words when the constructor HashSet<T>(IEqualityComparer<T>)
)) was added to the collection. This constructor was added in .NET Framework 3.5, which is the same release that introduced the HashSet<T>
collection. So the collection was never pure. The developers were able to use non-standard equality definitions from day one. The notion "two T
s that are equal by one definition are equal by all definitions" never existed for the HashSet<T>
.
Let me try to rephrase what I believe is problematic with this group of proposals:
When modelling a
HashSet<T>
whereT
has a certain definition of equality it is good practice to regardt1 : T
andt2 : T
as interchangeable (modulo mutability concerns) whenevert1.Equals(t2)
. This invariant is observed in all definitions of equality that the framework ships regardless of whether it uses reference or structural equality. BecauseT
is the key ofHashSet<T>
you can't actually look it up unless you already have aT
. But if you already have theT
then
- There is no need to bother with delayed initialization factories such as the ones declared here and
- There is no need to retrieve the underlying stored
T
if it exists because the two should be interchangeable. This implies thatbool Contains(T item)
andbool Add(T item)
are the only APIs needed to interrogate the hash set.In my opinion, types whose equality doesn't imply interchangeability gives a strong indication that the values should be indexed by a different key type using a dictionary.
I think this is a very narrow view of the HashSet
Proving that all these people participated in a mistake is not a trivial task! 🙂
I don't intend to, this is a discussion about adding a new API and not removing and old API 🙂
The notion "two
T
s that are equal by one definition are equal by all definitions" never existed for theHashSet<T>
.
I never claimed that types have a unique notion of equality, what I did say is that under a given definition of equality two equal values should be considered equivalent. IEqualityComparer<T>
is a tool that lets you override built-in equality, but it shouldn't invalidate that principle (in practice this can be abused).
I've personally used HashSet many times where you want to "deduplicate" a bunch of reference objects that are "identical" in semantic meaning but occupying different memory.
Same here, but the mere act of deduplication implies that you're happy to discard references up to that notion of equality, which suggests that these references are equivalent in your domain.
This is precisely the use case where GetOrAdd would be valuable and would greatly improve the performance.
How would a GetOrAdd
method improve performance in a deduplication method?
Proving that all these people participated in a mistake is not a trivial task! 🙂
I don't intend to, this is a discussion about adding a new API and not removing and old API 🙂
@eiriktsarpalis since you are arguing against the proposed GetOrAdd
API from a theoretical/philosophical/academic standpoint, and since this API is building on the precedence of the existing TryGetValue
API, I think that it would be intellectually consistent to argue against both APIs. Either both are mistakes, or none. If you think that the TryGetValue
was approved incorrectly, it's still possible to mitigate this mistake by obsoleting the API, hiding it from the tools, updating the documentation with warnings against using it etc.
I think that the GetOrAdd
could be implemented with the help of the CollectionsMarshal
like this:
public static T GetOrAdd<T>(this HashSet<T> source, T item)
{
ref T itemRef = ref CollectionsMarshal.GetItemRefOrAdd(source, item, out bool exists);
// In case the item is not found, it has been added to the set and assigned to the itemRef.
return itemRef;
}
Although the CollectionsMarshal.GetItemRefOrAdd
(as proposed in #82840) is unsafe, the above GetOrAdd
implementation is not. The caller of the GetOrAdd
cannot misuse it in a way that would corrupt the HashSet<T>
. The same cannot be said for the factory-based GetOrAdd
:
public static T GetOrAdd<T, TArg>(this HashSet<T> source, T item, Func<TArg, T> itemFactory, TArg factoryArgument)
{
ref T itemRef = ref CollectionsMarshal.GetItemRefOrAdd(source, item, out bool exists);
if (!exists)
{
item = itemFactory(factoryArgument);
Debug.Assert(source.Comparer.Equals(item, itemRef));
itemRef = item;
}
return itemRef;
}
This one transfers the responsibility of correctness to the caller. It's up to the itemFactory
to return a T
equal to the original T
. Checking for equality in release mode would defeat the whole purpose of this API, which is to have comparable performance with a single call to TryGetValue
or a single call to Add
.
Although the
CollectionsMarshal.GetItemRefOrAdd
(as proposed in #82840) is unsafe, the aboveGetOrAdd
implementation is not. The caller of theGetOrAdd
cannot misuse it in a way that would corrupt theHashSet<T>
. The same cannot be said for the factory-basedGetOrAdd
:public static T GetOrAdd<T, TArg>(this HashSet<T> source, T item, Func<TArg, T> itemFactory, TArg factoryArgument) { ref T itemRef = ref CollectionsMarshal.GetItemRefOrAdd(source, item, out bool exists); if (!exists) { item = itemFactory(factoryArgument); Debug.Assert(source.Comparer.Equals(item, itemRef)); Debug.Assert(source.Comparer.GetHashCode(item) == source.Comparer.GetHashCode(itemRef)); itemRef = item; } return itemRef; }
This one transfers the responsibility of correctness to the caller. It's up to the
itemFactory
to return aT
equal to the originalT
. Checking for equality in release mode would defeat the whole purpose of this API, which is to avoid hashing theT
element more than once.
I cannot for the life of me understand why the Factory version of this GetOrAdd for HashSet
@eiriktsarpalis - the performance improvement for a GetOrAdd in the dedup situation is that you don't "search for it" then "add it if you didn't find it" in two steps since these are two hashes and searches. Similar to the argument for the same thing for Dictionary.
@kellypleahy the purpose of the factory-based GetOrAdd
is to avoid an allocation in case the T
is reference type. For context see this previous comment.
I think the difference here with TryGetValue
vs GetOrAdd
is that TryGetValue
unlocked functionality that was not possible before but this is only for performance and convenience reasons which doesn't seem as compelling given the static footprint costs.
@kellypleahy the purpose of the factory-based
GetOrAdd
is to avoid an allocation in case theT
is reference type. For context see this previous comment.
Hmm... that one seems really bad to me, but I guess that's just my opinion. I'm 100% +1 on GetOrAdd(item)
. Much more dubious on GetOrAdd<TArg>(item, Func<TArg, T> factory, TArg factoryArg)
. I can only imagine that being confusing at best. If you really need to "initialize" the object that you passed to item, I think it should be reasonably easy for you to know that the item you passed was the one added (maybe object.ReferenceEquals() or something). I do see the use case you are talking about, but I feel like it would be mostly just confusing to callers and handles a VERY rare case and may not even provide a performance boost at all over the double lookups/insert (I'm not sure I have the brainpower to confirm or deny a benefit at the moment).
@TylerBrinkley in case you have any idea about what is the static footprint cost of adding a method in the HashSet<T>
collection, please share it because I would really love to know. Otherwise, how are you able to make a judgment? On the one hand we have performance and convenience, and on the other hand we have a mystery cost. That's hardly enough data for judging anything.
There are many times when there is a custom equality comparison being done but I need access to the original instance.
What I'd like:
Today I'm forced to use a Dictionary which duplicates storage of the key for no reason. A dictionary is also wasteful because it has to do two hash lookups, one for
TryGetValue
and one forAdd
:This is no better; it's a lot more to maintain, unnecessarily coupled to the definition of equality, it's still duplicate storage of the key, and it's still duplicating the number of hash lookups:
KeyedCollection is the worst of all because in addition to the dictionary, it keeps a list I'll never use.
Proposed API