dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.27k stars 4.73k forks source link

ConcurrentDictionary: ApproximateCount #23278

Closed tzachshabtay closed 3 years ago

tzachshabtay commented 7 years ago

Proposed API

    public class ConcurrentDictionary<TKey, TValue> : ...
    {
         ...
         public int ApproximateCount { get; }
         ...
    }

Rationale and Usage

Following the discussion from dotnet/runtime#15249, there seemed to be a consensus that an ApproximateCount property can fit plenty of use cases and can be implemented to be faster than Count (i.e avoid locking).

For example, we might have a cache implementation that does some cleanup or flush-to-disk after reaching some arbitrarily defined X items in the dictionary:

    if (map.Count > 10000) cleanup();

In this use-case it doesn't really matter what the exact value of X is, and is perfectly tolerable if some race condition causes it to be inaccurate once in a while, but the action itself might be called a lot and in quick succession (maybe even in a tight loop), so the cost of the lock in Count is not negligible, and the developer might prefer to write if (map.ApproximateCount> 10000) cleanup(); instead to avoid that cost.

Another use-case with similar rationale behind it might be when we want to construct a list from a dictionary (or multiple dictionaries/collections), and need to come up with a reasonable estimate of the list's capacity before-hand to avoid extra copies. We could then write something like: List<T> list = new List<T>(map1.ApproximateCount + map2.ApproximateCount);, if we believe that the cost of using Count would be greater than the occasional non-exact list capacity.

danmoseley commented 7 years ago

@tzachshabtay thanks for the suggestion. Can you update your comment to be more like the format linked to from https://github.com/dotnet/corefx/blob/master/Documentation/project-docs/api-review-process.md ?

tzachshabtay commented 7 years ago

@danmosemsft, done.

danmoseley commented 7 years ago

@ianhays, @safern are area owners and can give feedback, or set to ready for review, as they think best.

ianhays commented 7 years ago

This is an interesting proposal. It seems to me that the "best guess" approach would be fine for a pretty limited number of scenarios (e.g. construction in your OP) though.

This is one where I'd like to see if we get more interest before moving forward.

danmoseley commented 7 years ago

OK, community folks if you would use this proposal please thumbs up the top post. We can let this sit for a while...

bcronce commented 4 years ago

Having played with multithreading since a wee child, perfect is the enemy of good when it comes to concurrency. Knowing anything "exactly" is a concurrency anti-pattern. I've wrapped the concurrent dictionary just to get approximate counts many times.

Imagine if a video game wanted to know "exactly" when an event occurred and in what order. I can image it. I lived through the horrible netcode of early FPS video games. Modern netcode, both client and server side event handling has to deal with the relativity of events. What is the definition of "when" or "order" when talking about different frames of reference. This is relativity for programming. There is no universal time. All that matters is the emergent property of correctness and handling conflicts in a reasonable way.

One just needs to be very clear about defining the exact semantics guarantees of ApproximateCount. An example is if ConcurrentDictionary uses buckets and tracks the size of each bucket, and maybe the ApproximateCount returns the sum of the "current" size of each bucket, it is possible to return a value larger than actually happen. So if you read bucket 0 and is size 2, and bucket 1 is empty at this moment, then you read bucket 1 and it's now size 2. The sum would be 4, yet maybe the max number of items was only 2. That shouldn't matter. All that really matters is the value is greater never negative.

And who really cares. Even if you could get the exact value of "2", the very instant you release a lock and try to do something with that "2", it's already possibly changed. Dealing with concurrency is like dealing with floating point values. Never do exactness.

As powerful as the idea of ApproximateCount is, I feel the crux is making sure the documentation is very well defined about what it actually means.

scalablecory commented 4 years ago

Even if you could get the exact value of "2", the very instant you release a lock and try to do something with that "2", it's already possibly changed.

Count must already be treated as ApproximateCount for this very reason.

It would be useful before thinking of a new API to look at the implementation of Count and see if it can be done with less locking without affecting the performance of the other operations.

ghost commented 3 years ago

Due to lack of recent activity, this issue has been marked as a candidate for backlog cleanup. It will be closed if no further activity occurs within 14 more days. Any new comment (by anyone, not necessarily the author) will undo this process.

This process is part of the experimental issue cleanup initiative we are currently trialing in a limited number of areas. Please share any feedback you might have in the linked issue.

ghost commented 3 years ago

This issue will now be closed since it had been marked no recent activity but received no further activity in the past 14 days. It is still possible to reopen or comment on the issue, but please note that the issue will be locked if it remains inactive for another 30 days.