dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.45k stars 4.76k forks source link

[API Proposal]: Add overloaded methods for GetOrAdd, adding out parameters indicating if the key was added to the dictionary #79934

Open MarkCiliaVincenti opened 1 year ago

MarkCiliaVincenti commented 1 year ago

Background and motivation

Sometimes you need to run code only if there's a Get or an Add to the ConcurrentDictionary.

Consider this example:

while (!TryGetValue(key, out releaser) || !releaser.TryIncrement())
{
    if (TryAdd(key, releaserToAdd))
    {
        return releaserToAdd;
    }
}

The problem with this code is that you can't use GetOrAdd because in this example one only wants to do the increment on the get, not the add. Therefore, one has to use both TryGetValue and TryAdd, which is also suboptimal because the key.GetHashCode() is run again for the TryAdd (which wouldn't be the case if one used GetOrAdd).

With this change in place, the above code could be rewritten as such and work more efficiently because it wouldn't call GetHashCode() every time it tries to add:

while (true)
{
    releaser = GetOrAdd(key, releaserToAdd, out bool added);
    if (releaser != default)
    {
        if (!added && !releaser.TryIncrement())
        {
            continue;
        }
        return releaser;
    }
}

I have already tried to make a pull request at https://github.com/dotnet/runtime/pull/79931 but this has been closed as it didn't follow the API review process, but I am linking to it so you can easily see the small code changes.

API Proposal

/// <summary>
/// Adds a key/value pair to the <see cref="ConcurrentDictionary{TKey,TValue}"/>
/// if the key does not already exist.
/// </summary>
/// <param name="key">The key of the element to add.</param>
/// <param name="valueFactory">The function used to generate a value for the key</param>
/// <param name="added">When this method retuns, <paramref name="added"/> is set to true
/// when the key was added to the dictionary, otherwise false.</param>
/// <exception cref="ArgumentNullException"><paramref name="key"/> is a null reference
/// (Nothing in Visual Basic).</exception>
/// <exception cref="ArgumentNullException"><paramref name="valueFactory"/> is a null reference
/// (Nothing in Visual Basic).</exception>
/// <exception cref="OverflowException">The dictionary contains too many
/// elements.</exception>
/// <returns>The value for the key.  This will be either the existing value for the key if the
/// key is already in the dictionary, or the new value for the key as returned by valueFactory
/// if the key was not in the dictionary.</returns>
public TValue GetOrAdd(TKey key, Func<TKey, TValue> valueFactory, out bool added)
{
    if (key is null)
    {
        ThrowHelper.ThrowKeyNullException();
    }

    if (valueFactory is null)
    {
        ThrowHelper.ThrowArgumentNullException(nameof(valueFactory));
    }

    IEqualityComparer<TKey>? comparer = _comparer;
    int hashcode = comparer is null ? key.GetHashCode() : comparer.GetHashCode(key);

    added = false;

    if (!TryGetValueInternal(key, hashcode, out TValue? resultingValue))
    {
        added = TryAddInternal(key, hashcode, valueFactory(key), updateIfExists: false, acquireLock: true, out resultingValue);
    }

    return resultingValue;
}

API Usage

while (true)
{
    releaser = GetOrAdd(key, releaserToAdd, out bool added);
    if (releaser != default)
    {
        if (!added && !releaser.TryIncrement())
        {
            continue;
        }
        return releaser;
    }
}

Alternative Designs

No response

Risks

No response

dotnet-issue-labeler[bot] commented 1 year ago

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

svick commented 1 year ago

Can't you determine whether the value was added by comparing it with the returned value?

ghost commented 1 year ago

Tagging subscribers to this area: @dotnet/area-system-collections See info in area-owners.md if you want to be subscribed.

Issue Details
### Background and motivation Sometimes you need to run code only if there's a Get or an Add to the ConcurrentDictionary. Consider this example: ```` while (!TryGetValue(key, out releaser) || !releaser.TryIncrement()) { if (TryAdd(key, releaserToAdd)) { return releaserToAdd; } } ```` The problem with this code is that you can't use `GetOrAdd` because in this example one only wants to do the increment on the get, not the add. Therefore, one has to use both `TryGetValue` and `TryAdd`, which is also suboptimal because the `key.GetHashCode()` is run again for the `TryAdd` (which wouldn't be the case if one used `GetOrAdd`). With this change in place, the above code could be rewritten as such and work more efficiently because it wouldn't call `GetHashCode()` every time it tries to add: ```` while (true) { releaser = GetOrAdd(key, releaserToAdd, out bool added); if (releaser != default) { if (!added && !releaser.TryIncrement()) { continue; } return releaser; } } ```` I have already tried to make a pull request at https://github.com/dotnet/runtime/pull/79931 but this has been closed as it didn't follow the API review process, but I am linking to it so you can easily see the small code changes. ### API Proposal ```` /// /// Adds a key/value pair to the /// if the key does not already exist. /// /// The key of the element to add. /// The function used to generate a value for the key /// When this method retuns, is set to true /// when the key was added to the dictionary, otherwise false. /// is a null reference /// (Nothing in Visual Basic). /// is a null reference /// (Nothing in Visual Basic). /// The dictionary contains too many /// elements. /// The value for the key. This will be either the existing value for the key if the /// key is already in the dictionary, or the new value for the key as returned by valueFactory /// if the key was not in the dictionary. public TValue GetOrAdd(TKey key, Func valueFactory, out bool added) { if (key is null) { ThrowHelper.ThrowKeyNullException(); } if (valueFactory is null) { ThrowHelper.ThrowArgumentNullException(nameof(valueFactory)); } IEqualityComparer? comparer = _comparer; int hashcode = comparer is null ? key.GetHashCode() : comparer.GetHashCode(key); added = false; if (!TryGetValueInternal(key, hashcode, out TValue? resultingValue)) { added = TryAddInternal(key, hashcode, valueFactory(key), updateIfExists: false, acquireLock: true, out resultingValue); } return resultingValue; } ```` ### API Usage ```` while (true) { releaser = GetOrAdd(key, releaserToAdd, out bool added); if (releaser != default) { if (!added && !releaser.TryIncrement()) { continue; } return releaser; } } ```` ### Alternative Designs _No response_ ### Risks _No response_
Author: MarkCiliaVincenti
Assignees: -
Labels: `api-suggestion`, `area-System.Collections`, `untriaged`
Milestone: -
MarkCiliaVincenti commented 1 year ago

Can't you determine whether the value was added by comparing it with the returned value?

First of all no, because the value you get will often be equal to the one you're trying to add.

Secondly, even if you could, it's suboptimal having to compare them. Then you might as well use the very first example and only do the GetHashCode() once, when you fail to get, rather than twice in every scenario (to compare the one you tried to add with the one you got/added):

while (!TryGetValue(key, out releaser) || !releaser.TryIncrement())
{
    if (TryAdd(key, releaserToAdd))
    {
        return releaserToAdd;
    }
}

I'm trying to optimize the above. It's wasteful having to do a GetHashCode() again for the inner TryAdd given that the TryAddInternal method is private.

MarkCiliaVincenti commented 1 year ago

Also, related to this, because the idea at the end of the day is to avoid unnecessary multiple calls to GetHashCode(), what is the reasoning behind TryGetValueInternal and TryAddInternal being private? In very high concurrency you may need to alternate between gets and adds multiple times until you finally manage to get or add, if you're locking and updating after the get, so I don't understand why the methods which accept a hashcode are made inaccessible.

svick commented 1 year ago

First of all no, because the value you get will often be equal to the one you're trying to add.

That sounds weird to me. Can you explain more about your use case? If you have some other way of getting the value associated with the key, why do you need the dictionary in the first place?

Secondly, even if you could, it's suboptimal having to compare them.

It shouldn't be. If the values are reference types, then ReferenceEquals is very cheap. If the values are small value types, then comparing for equality should also be cheap. That leaves large value types, which should be rare.

MarkCiliaVincenti commented 1 year ago

First of all no, because the value you get will often be equal to the one you're trying to add.

That sounds weird to me. Can you explain more about your use case? If you have some other way of getting the value associated with the key, why do you need the dictionary in the first place?

Secondly, even if you could, it's suboptimal having to compare them.

It shouldn't be. If the values are reference types, then ReferenceEquals is very cheap. If the values are small value types, then comparing for equality should also be cheap. That leaves large value types, which should be rare.

My use case is for https://github.com/MarkCiliaVincenti/AsyncKeyedLock/blob/6.0.2/AsyncKeyedLock/AsyncKeyedLockDictionary.cs

I'm adding an instance of a class which has an int property. When added, that int needs to be 1. Every time I read it, I need to increase it by 1. In a way it is a counter. Now if it's just been added (so it's in the dictionary with value of 1) then there's no knowing whether it was added or whether it was read from the dictionary. Both would have the value of 1 but when it's read from the dictionary you'd want to change it to a 2.

As for the micro-optimization part... TryGetValueInternal, TryAddInternal and TryUpdateInternal exist simply so methods like GetOrAdd don't go through GetHashCode() twice.

Why that optimization is not provided to whoever cannot use GetOrAdd I don't exactly understand why.

Ideally above line 83 in https://github.com/MarkCiliaVincenti/AsyncKeyedLock/blob/6.0.2/AsyncKeyedLock/AsyncKeyedLockDictionary.cs I would GetHashCode() and pass them on to all subsequent TryGetValue and TryAdd calls. But because the methods accepting a hashcode are private, I can't, so I will repeatedly end up compiling hashcodes with no way of telling how long that would take especially in the case of complex objects.

svick commented 1 year ago

As I mentioned, since the value to add is a reference type that is not the same as the value that's already in the dictionary, you can use ReferenceEquals to see whether that value was added. No new API needed.

MarkCiliaVincenti commented 1 year ago

As I mentioned, since the value to add is a reference type that is not the same as the value that's already in the dictionary, you can use ReferenceEquals to see whether that value was added. No new API needed.

With what I am trying to achieve, I don't think it's possible for GetHashCode() to get computed just once, except for in the happy path.

eiriktsarpalis commented 1 year ago

As I mentioned, since the value to add is a reference type that is not the same as the value that's already in the dictionary, you can use ReferenceEquals to see whether that value was added. No new API needed.

With what I am trying to achieve, I don't think it's possible for GetHashCode() to get computed just once, except for in the happy path.

Why would you need a hash code to determine reference equality?

MarkCiliaVincenti commented 1 year ago

As I mentioned, since the value to add is a reference type that is not the same as the value that's already in the dictionary, you can use ReferenceEquals to see whether that value was added. No new API needed.

With what I am trying to achieve, I don't think it's possible for GetHashCode() to get computed just once, except for in the happy path.

Why would you need a hash code to determine reference equality?

The hashcode gets computed every time you call TryGetValue, TryUpdate, TryAdd etc.

Sometimes, including in my code, you need to call TryGetValue, TryAdd etc multiple times, sometimes in a while(true). These methods call GetHashCode() every time. The methods that accept an already-computed hashcode are marked as not exposed publicly.

Here's similar code that's also calling TryGetValue, TryUpdate and TryAdd within a while loop, perhaps you are familiar with the SixLabors ImageSharp libraries: https://github.com/SixLabors/ImageSharp.Web/blob/main/src/ImageSharp.Web/Synchronization/RefCountedConcurrentDictionary.cs

Code like this would benefit from exposing the methods that accept the hashcode because then you are able to call GetHashCode once. Under heavy load, with my library or with ImageSharp.Web or with other libraries like https://github.com/amoerie/keyed-semaphores/blob/main/KeyedSemaphores/KeyedSemaphoresCollection.cs, GetHashCode could be called several times, needlessly.

eiriktsarpalis commented 1 year ago

I think what @svick suggested is, assuming your value is a reference type, you could try doing the following:

TValue result = dict.GetOrAdd(key, value);
bool added = ReferenceEquals(value, result);

With a little bit of work, it should be possible to extend that to the overload that accepts a value factory.

MarkCiliaVincenti commented 1 year ago

I think what @svick suggested is, assuming your value is a reference type, you could try doing the following:

TValue result = dict.GetOrAdd(key, value);
bool added = ReferenceEquals(value, result);

With a little bit of work, it should be possible to extend that to the overload that accepts a value factory.

I am already using GetOrAdd, but I still need it in a loop, and I still need to do the TryGetValue and TryAdd before.

What I'm trying to do here, using a ConcurrentDictionary and updating the reference counter in a thread safe manner and removing it when the counter reaches 0, is simply impossible to do with the current API without GetHashCode() getting called multiple times.

So yes, there clearly is a need for this API, because it would allow me, and others in my situation, to write more optimal code.

theodorzoulias commented 1 year ago

@MarkCiliaVincenti the motivation for the proposed GetOrAdd overload with the out bool added parameter is to use it like this:

releaser = GetOrAdd(key, releaserToAdd, out bool added);

The counter-argument is that the added can ne infered using the existing APIs, like this:

releaser = GetOrAdd(key, releaserToAdd);
bool added = ReferenceEquals(releaser, releaserToAdd);

So it seems that the proposal is not opening new possibilities for using the ConcurrentDictionary<TKey,TValue> collection. Specifically it's not reducing the number of times that the GetHashCode has to be called in any scenario. Maybe you have in mind some other proposal, that would involve exposing publicly methods that are currently private or internal?

MarkCiliaVincenti commented 1 year ago

@theodorzoulias yes I am now proposing making TryGetValueInternal, TryAddInternal, TryUpdateInternal public.

Although the original proposal also still holds. Using ReferenceEquals won't work if the key is a struct, for example.

theodorzoulias commented 1 year ago

@MarkCiliaVincenti you could try submitting a proposal about exposing the TryGetValueInternal, TryAddInternal and TryUpdateInternal methods, but I wouldn't be very optimistic about seeing it approved. You would have to present some really good arguments to support it, including usage scenarios and benchmarks, in order to overcome the instinctively negative feelings that such a proposal generates.

You might want to take a look at an informal proposal by @sab39 about a flexible Modify method that might be sufficient for your scenarios. That proposal (if submitted properly) might have better chances to be approved if it gets enough traction.

MarkCiliaVincenti commented 1 year ago

in order to overcome the instinctively negative feelings that such a proposal generates.

Why is this so? I'm trying to understand.

theodorzoulias commented 1 year ago

@MarkCiliaVincenti because the suffix Internal in their names signifies that they are intended to be hidden from the public view. :-) Btw there is a TryRemoveInternal method too.

MarkCiliaVincenti commented 1 year ago

@MarkCiliaVincenti because the suffix Internal in their names signifies that they are intended to be hidden from the public view. :-)

Yes of course, but what I don't understand is the reasoning why they were not made public in the first place. Was it to not add clutter? The people who made these methods did so because they realized that it is more optimal to pass on the already-computed hash code rather than computing it again when making methods such as GetOrAdd. If it's useful for GetOrAdd then it should also be useful for whoever needs to do multiple dictionary API calls in their code.

theodorzoulias commented 1 year ago

@MarkCiliaVincenti I think that adding public APIs that accept key+hashcode would be an admission that the IEqualityComparer<T> abstraction has failed. That's a hard pill to swallow. I do understand that these methods could be useful though.

MarkCiliaVincenti commented 1 year ago

@MarkCiliaVincenti I think that adding public APIs that accept key+hashcode would be an admission that the IEqualityComparer<T> abstraction has failed. That's a hard pill to swallow. I do understand that these methods could be useful though.

Why would it have failed though? Because we have such interfaces, it's possible to override the GetHashCode methods, and then there's no limit to how long these methods can be expected to take to return a value. Theoretically, it could take minutes, or hours, although extremely unlikely. More realistically, though, it could take enough time that makes my proposal of exposing these methods feasible.

theodorzoulias commented 1 year ago

@MarkCiliaVincenti you could try submitting a formal proposal, present your best arguments, explain why all the alternatives are worse, and see how it goes. Personally I have conflicting feelings about this, so I am not promising that I'll upvote it.