Closed ebickle closed 3 years ago
Operation Complexity Construct Using IEnumerable Θ(log n)
I think this should be Θ(n). You need to at least iterate the input.
+1
Rx has a highly production-tested priority queue class:
Should a nested public enumerator structure be used to avoid an additional heap allocation during calls to GetEnumerator and when using foreach? My assumption is no, since enumerating over a queue is an uncommon operation.
I'd lean towards using the struct enumerator to be consistent with Queue<T>
which uses a struct enumerator. Also, if a struct enumerator isn't used now, we won't be able to change PriorityQueue<T>
to use one in the future.
Also perhaps a method for batch inserts? Can always then sort and continue from previous insertion point rather than starting at the beginning if that would help?:
public void Enqueue(List<T> items) {
items.Sort(_comparer);
... insertions ...
}
I've tossed a copy of the initial implementation here. Test coverage is far from complete, but if anyone is curious please take a look and let me know what you think. I've tried to follow the existing coding conventions from the System.Collections classes as much as possible.
Cool. Some initial feedback:
Queue<T>.Enumerator
implements IEnumerator.Reset
explicitly. Should PriorityQueue<T>.Enumerator
do the same?Queue<T>.Enumerator
uses _index == -2
to indicate the enumerator has been disposed. PriorityQueue<T>.Enumerator
has the same comment but has an extra _disposed
field. Consider getting rid of the extra _disposed
field and use _index == -2
to indicate it's been disposed to make the struct smaller and to be consistent with Queue<T>
_emptyArray
field can be removed and usage replaced with Array.Empty<T>()
instead.Also...
Dictionary<TKey, TValue>
, HashSet<T>
, SortedList<TKey, TValue>
, SortedDictionary<TKey, TValue>
, SortedSet<T>
, etc.) allow null to be passed in for the comparer, in which case Comparer<T>.Default
is used.Also...
ToArray
can be optimized by checking for _size == 0
before allocating the new array, in which case just return Array.Empty<T>()
.@justinvp Great feedback, thanks!
@ebickle
Array.Empty is a F# feature, so can't use it here sadly!
Not anymore: https://github.com/dotnet/coreclr/blob/master/src/mscorlib/src/System/Array.cs#L1060-L1069
I've migrated the code over to ebickle/corefx in the issue-574 branch.
Implemented the Array.Empty
Will be fixed once issue dotnet/corefx#966 is complete.
One key area I'm looking for feedback is how ToArray, CopyTo, and the enumerator should be handled. Currently they're optimized for performance, which means the target array is a heap and not sorted by the comparer.
There are three options: 1) Leave it as-is, and document that the returned array is not sorted. (it's a priority queue, not a sorted queue) 2) Modify the methods to sort the returned items/arrays. They will no longer be O(n) operations. 3) Remove the methods and enumerable support altogether. This is the "purist" option but removes the ability to quickly grab the remaining queued items when the priority queue is no longer needed.
Another thing I'd like feedback on is whether the queue should be stable for two items with the same priority (compare result of 0). Generally priority queues make no guarantees that two items with the same priority will be dequeued in the order they were enqueued, but I noticed that internal priority queue implementation used in System.Reactive.Core took some additional overhead to ensure this property. My preference would be to not do this, but I'm not completely sure which option is better in terms developers' expectations.
I happened upon this PR because I too was interested in adding a priority queue to .NET. Glad to see someone went to the effort of making this proposal :). After reviewing the code, I noticed the following:
IComparer
ordering is not consistent with Equals
, the behavior of this Contains
implementation (which uses the IComparer
) may be surprising to some users, as it is essentially contains an item with equal priority.Dequeue
. Shrinking the heap array by half when quarter full is typical.Enqueue
method accept null
arguments?Clear
should be Θ(n), since that is the complexity of System.Array.Clear
, which it uses. https://msdn.microsoft.com/en-us/library/system.array.clear%28v=vs.110%29.aspxI didn't see code to shrink the array in
Dequeue
. Shrinking the heap array by half when quarter full is typical.
Queue<T>
and Stack<T>
also don't shrink their arrays (based on Reference Source, they're not in CoreFX yet).
@mbeidler I'd considered adding some form of automatic array shrinking in dequeue, but as @svick pointed out it doesn't exist in the reference implementations of similar data structures. I'd be curious to hear from anyone on the .NET Core/BCL team whether there's any particular reason they chose that style of implementation.
Update: I checked List
Enqueue should support nulls, and is documented as allowing them. Did you run into a bug? I can't remember how robust the unit tests area in the area yet.
Interesting, I noticed in the reference source linked by @svick that Queue<T>
has an unused private constant named _ShrinkThreshold
. Perhaps that behavior existed in a previous version.
Concerning the use of IComparer
instead of Equals
in the implementation of Contains
, I wrote up the following unit test, which would fail currently: https://gist.github.com/mbeidler/9e9f566ba7356302c57e
@mbeidler Good point. According to MSDN, IComparer
However, it looks as though the same issue exists in the other collection classes. If I modify the code to operate on SortedList<Patient, Patient>, the test case still fails on ContainsKey. The implementation of SortedList<TKey, TValue>.ContainsKey calls Array.BinarySearch, which relies on IComparer to check for equality. The same holds true for SortedSet
Arguably it's a bug in the existing collection classes as well. I'll dug through the rest of the collections classes and see if there are any other data structures that accept an IComparer but test for equality separately. You're right though, for a priority queue you would expect custom ordering behavior that's completely independent from equality.
Committed a fix and the test case into my fork branch. The new implementation of Contains is based directly on the behavior of List
When I get some time later today, I'll likely submit bug reports for the other built-in collections. The probably can't be fixed due to regression behavior, but at very least the documentation needs to be updated.
I think it makes sense for ContainsKey
to use the IComparer<TKey>
implementation, since that is what specifies the key. However, I think it would be more logical for ContainsValue
to use Equals
instead of IComparable<TValue>
in its linear search; though in this case the scope is significantly reduced since a type's natural ordering is much less likely to be inconsistent with equals.
It appears that in the MSDN documentation for SortedList<TKey, TValue>
, the remarks section for ContainsValue
does indicate that the TValue's sort ordering is used in lieu of equality.
@terrajobst How do you feel about the API proposal so far? Do you feel this a good fit for CoreFX?
:+1:
Thanks for filing this. I believe we've enough data to make a formal review of this proposal, hence I labelled it as 'ready for api review'
As Dequeue
and Peek
are throwing methods the caller needs to check count before each call. Would it make sense to instead (or in addition) provide TryDequeue
and TryPeek
following the pattern of the concurrent collections? There are issues open to add non-throwing methods to existing generic collections so adding a new collection which doesn't have these methods seems counterproductive.
@andrewgmorris related https://github.com/dotnet/corefx/issues/4316 "Add TryDequeue to Queue"
We had a basic review of this and we agree that we want a ProrityQueue in the framework. However, we need to get someone to help drive the design and implementation for it. Whoever grabs the issue can work @terrajobst on finalizing the API.
So what work is left on the API?
This is missing from the API proposal above: PriorityQueue<T>
should implement IReadOnlyCollection<T>
to match Queue<T>
(Queue<T>
now implements IReadOnlyCollection<T>
as of .NET 4.6).
I don't know that array-based priority queues are best. Memory allocation in .NET is really fast. We don't have the same search-small-blocks issue that the old malloc dealt with. You are welcome to use my priority queue code from here: https://github.com/BrannonKing/Kts.Astar/tree/master/Kts.AStar
@ebickle One tiny nit on the speclet. It says:
/// Adds an object to the end of the <see cref="PriorityQueue{T}"/>. ... public void Enqueue(T item);
Shouldn't it say instead, /// Inserts object into the <see cref="PriorityQueue{T}"> by its priority.
@SunnyWar Fixed the documentation of the Enqueue method, thank you!
Some time ago, I created a data structure with complexities similar to a priority queue based on a Skip List data structure which I've decided at this point to share: https://gist.github.com/bbarry/5e0f3cc1ac7f7521fe6ea25947f48ace
https://en.wikipedia.org/wiki/Skip_list
The skip list matches the complexities of a priority queue above in average cases, except that Contains is an average case O(log(n))
. In addition, access to either first or last elements are constant time operations and iteration in both forward and reverse order matches the forward order complexities of a PQ.
There obviously are downsides to such a structure in the form of higher costs on memory, and it does devolve to O(n)
worst case inserts and removals, so it has its trade offs...
Is this already implemented somewhere? When is the expected release? Also whats about updating the priority of an existing item?
@Priya91 @ianhays is this ready to get marked ready for review?
This is missing from the API proposal above: PriorityQueue
should implement IReadOnlyCollection to match Queue (Queue now implements IReadOnlyCollection as of .NET 4.6).
I agree with @justinvp here.
@Priya91 @ianhays is this ready to get marked ready for review?
I'd say so. This has been sitting for a while; lets make moves on it.
@justinvp @ianhays I've updated the spec to implement IReadOnlyCollection
I have a full implementation of the class and it's associated PriorityQueueDebugView that uses an array-based implementation. Unit tests aren't at 100% coverage yet, but if there's some interest I can do a bit of work and dust off my fork.
@NKnusperer made a good point about updating the priority of an existing item. I'll leave it out for now but it's something to consider during spec review.
There are 2 implementations in full framework, that are possible alternatives. https://referencesource.microsoft.com/#q=priorityqueue
For reference here is a question about the Java PriorityQueue on stackoverflow http://stackoverflow.com/questions/683041/java-how-do-i-use-a-priorityqueue. It is interesting that the priority is handled by a comparer instead of just a simple int priority wrapper object. For example it doesn't make it easy to change the priority of an item in the queue already.
API review: We agree that it is useful to have this type in CoreFX, because we expect CoreFX to use it.
For final review of the API shape, we would like to see sample code: PriorityQueue<Thread>
and PriorityQueue<MyClass>
.
T
.Notes:
Remove
and Add
on the queue.This would be a really useful type to have in CoreFX. Is anyone interested in grabbing this one?
I don't like the idea of fixing the priority queue to a binary heap. Have a look at my AlgoKit wiki page for details. Quick thoughts:
Last year I implemented a few heap types. In particular, there is IHeap
interface that could be used as the priority queue.
I'm all for introducing an IHeap
interface and some most performant implementations (at least those array based). The API and implementations are in the repository I linked above.
So no priority queues. Heaps.
What do you think?
@karelz @safern @danmosemsft
@pgolebiowski Don't forget we are designing easy-to-use-while-pretty-performant API. If you want PriorityQueue
(which is established computer-science term), you should be able to find one in docs / via internet search.
If the underlying implementation is heap (which I personally wonder why), or something else, it doesn't matter too much. Assuming you can demonstrate the implementation is measurably better than simpler alternatives (complexity of code is also a metric that somewhat matters).
So if you still believe heap-based implementation (without IHeap
API) is better than some simple list-based or list-of-array-chunks-based implementation, please explain why (ideally in a few sentences/paragraphs), so that we can get agreement on the implementation approach (and avoid wasting time on your side on complex implementation which may be turned down at PR review time).
Drop ICollection
, IEnumerable
? Just have the generic versions (though generic IEnumerable<T>
will bring in IEnumerable
)
@pgolebiowski how its implemented doesn't change the external api. PriorityQueue
defines the behaviour/contract; whereas a Heap
is a specific implementation.
If the underlying implementation is heap (which I personally wonder why), or something else, it doesn't matter too much. Assuming you can demonstrate the implementation is measurably better than simpler alternatives (complexity of code is also a metric that somewhat matters).
So if you still believe heap-based implementation is better than some simple list-based or list-of-array-chunks-based implementation, please explain why (ideally in a few sentences/paragraphs), so that we can get agreement on the implementation approach [...]
OK. A priority queue is an abstract data structure that can be then implemented in some way. Of course you can implement it with a data structure different than a heap. But no way is more efficient. As a result:
To back my words let's start with the theoretical support. Introduction to Algorithms, Cormen:
[…] priority queues come in two forms: max-priority queues and min-priority queues. We will focus here on how to implement max-priority queues, which are in turn based on max- heaps.
Stated clearly that priority queues are heaps. This is a shortcut of course, but you get the idea. Now, more importantly, let's have a look at how other languages and their standard libraries add support for the operations we are discussing:
PriorityQueue<T>
— priority queue implemented with a heap.BinaryHeap
— heap explicitly in the API. It says in the docs that it's a priority queue implemented with a binary heap. But the API is very clear on the structure — a heap.CFBinaryHeap
— again, explicitly says what the data structure is, avoiding the usage of the abstract term "priority queue". The docs describing the class: Binary heaps can be useful as priority queues. I like the approach.priority_queue
— once again, implemented canonically with a binary heap built on the top of an array.heapq
— heap is explicitly exposed in the API. The priority queue is only mentioned in the docs: This module provides an implementation of the heap queue algorithm, also known as the priority queue algorithm. Heaps are binary trees. Also notice the huge shortcut to say that a heap equals a priority queue and that a heap equals a binary tree.heap package
— there is even a heap interface. No explicit priority queue, but again only in docs: A heap is a common way to implement a priority queue. To build a priority queue, implement the Heap interface.I strongly believe we should go the Rust / Swift / Python / Go way and expose a heap explicitly while clearly stating in the docs that it can be used as a priority queue. I strongly believe this approach is very clean and simple.
I hope you all like the approach of just clearly exposing a heap data structure — and making a priority queue reference in the docs only.
We need to agree upon the matter above, but let me just start another topic — how to implement this. I can see two solutions here:
ArrayHeap
class. Seeing the name you can immediately tell what kind of heap you're dealing with (again, there are dozens of heap types). You see it is array-based. You immediately know the beast you're dealing with.IHeap
interface and providing one or more implementations. The customer could write code that depends on interfaces — this would allow providing really clear and readable implementations of complex algorithms. Imagine writing a DijkstraAlgorithm
class. It could simply depend on the IHeap
interface (one parametrized constructor) or just use the ArrayHeap
(default constructor). Clean, simple, explicit, no ambiguity involved due to using a "priority queue" term. And a wonderful interface that makes very much sense theoretically.In the approach above, the ArrayHeap
represents an implicit heap-ordered complete d-ary tree, stored as an array. This can be used to create for example a BinaryHeap
or a QuaternaryHeap
.
For further discussion I strongly encourage you to have a look at this paper. You will know that:
4-ary heaps (quaternary) are simply faster than 2-ary heaps (binary). There are many tests done in the paper — you would be interested in comparing the performance of implicit_4
and implicit_2
(sometimes implicit_simple_4
and implicit_simple_2
).
The optimal choice of implementation is strongly input-dependent. Out of implicit d-ary heaps, pairing heaps, Fibonacci heaps, binomial heaps, explicit d-ary heaps, rank-pairing heaps, quake heaps, violation heaps, rank-relaxed weak heaps, and strict Fibonacci heaps, the following types seem to cover almost all needs for various scenarios:
IHeap
approach, it would be worth adding this, as in many cases it is really fast and faster than the array-based solution.For both of them the programming effort is surprisingly low. Have a look at my implementations.
@karelz @benaadams
This would be a really useful type to have in CoreFX. Is anyone interested in grabbing this one?
@safern I'd be very happy to grab this one from now on.
OK, it was a problem between my keyboard and my chair - PriorityQueue
is of course based on Heap
-- I was thinking about just Queue
where it doesn't make sense and forgot that heap is 'sorted' -- very embarrassing thinking-process-outage for someone like me, who loves logic, algorithms, Turing machines, etc., my apologies. (BTW: As soon as I read couple of sentences in your Java docs link, the discrepancy immediately clicked)
From that perspective it makes sense to build the API on top of Heap
. We should not make that class public yet though - it will require its own API review and its own discussion if it is something we need in CoreFX. We do not want API surface creep due to implementation, but it may be the right thing to do -- hence the discussion needed.
From that perspective I don't think we need to create IHeap
just yet. It may be good decision later.
If there is research that specific heap (e.g. 4-ary as you mention above) is best for general random input, then we should choose that. Let's wait for @safern @ianhays @stephentoub to confirm/voice opinion.
The parametrization of underlying heap with multiple implemented options is something which IMO does not belong into CoreFX (I may be wrong here, again - let's see what others think). My reason is that IMO we would just soon ship gazillions of specialized collections which would be very hard for people (average developer without strong background in nuances of algorithms) to choose from. Such library, however, would make a great NuGet package, for experts in the field - owned by you/community. In future, we may consider adding it into PowerCollections (we are actively discussing for last 4 months where on GitHub to put this library and if we should own it, or if we should encourage community to own it - there are different opinions on the matter, I expect we will finalize its fate post 2.0)
Assigning to you as you want to work on it ...
@pgolebiowski collaborator invite sent - when you accept ping me, I will be able to assign it to you then (GitHub limitations).
@benaadams I would keep ICollection
(mild preference). For consistency with other d.s. in CoreFX. IMO it is not worth to have one odd beast here ... if we were adding a handful new ones (e.g. PowerCollections even to another repo), we should not include the non-generic ones ... thoughts?
OK, it was a problem between my keyboard and my chair.
Haha 😄 No worries.
it makes sense to build the API on top of Heap. We should not make that class public yet though [...] We do not want API surface creep due to implementation, but it may be the right thing to do -- hence the discussion needed. [...] I don't think we need to create IHeap just yet. It may be good decision later.
If the decision of the group is to go with PriorityQueue
, I'll just help with the design and implement this. However, please take into consideration the fact that if we add a PriorityQueue
now, it will be messy in the API later to add Heap
-- as both behave basically the same. It would be sort of redundancy IMO. This would be a design smell for me. I wouldn't add the priority queue. It doesn't help.
Also, one more thought. Actually, the pairing heap data structure could come in handy fairly often. Array-based heaps are horrible at merging them. This operation is basically linear. When you're having lots of merging heaps, you're killing the performance. However, if you use a pairing heap instead of an array heap -- the merging operation is constant (amortized). This is another argument why I would like to provide a nice interface and two implementations. One for general input, second for some specific scenarios, especially when merging of heaps is involved.
Voting for IHeap
+ ArrayHeap
+ PairingHeap
! 😄 (like in Rust / Swift / Python / Go)
If the pairing heap is too much -- OK. But let's at least go with IHeap
+ ArrayHeap
. Don't you guys feel that going with a PriorityQueue
class is locking the possibilities in the future and making the API less clear?
But as I said -- if you all vote for a PriorityQueue
class over the proposed solution -- OK.
collaborator invite sent - when you accept ping me, I will be able to assign it to you then (GitHub limitations).
@karelz ping :)
please take into consideration the fact that if we add a PriorityQueue now, it will be messy in the API later to add Heap -- as both behave basically the same. It would be sort of redundancy IMO. This would be a design smell for me. I wouldn't add the priority queue. It doesn't help.
Can you explain more details about why it will be messy later? What are your concerns?
PriorityQueue
is concept people use. Having a type named that way is useful, right?
I think the logical operations (at least their names) on Heap
might be different. If they are the same, we can have 2 different implementations of the same code in the worst case (not ideal, but not the end of the world). Or we can insert Heap
class as parent of PriorityQueue
, right? (assuming it is allowed from API review perspective - right now I don't see reason for not, but I don't have that many years of experience with API reviews, so will wait on others to confirm)
Let's see how the voting & further design discussion go ... I am slowly warming up to the idea of IHeap
+ ArrayHeap
, but not fully convinced yet ...
if we were adding a handful new ones ... we should not include the non-generic ones
Red rag to a bull. Anyone got some other collections to add so we can drop ICollection
?
Circular/Ring buffer; Generic and Concurrent?
@karelz A solution for the naming issue could be something like IPriorityQueue
like DataFlow does for produce/consumer patterns. Many ways to implement a priority queue and if you don't care about that use the interface. Care about the implementation or are creating an instance use implementing class.
See LATEST Proposal in corefxlab repo.
Second Proposal options
Proposal from https://github.com/dotnet/corefx/issues/574#issuecomment-307971397
Assumptions
Elements in priority queue are unique. If they are not, we would have to introduce 'handles' of items to enable their update/remove. Or the update/remove semantics would have to apply to first/all, which is weird.
Modeled after
Queue<T>
(MSDN link)API
Open questions:
PriorityQueue
vs.Heap
IHeap
and constructor overload? (Should we wait for later?)IPriorityQueue
? (Should we wait for later -IDictionary
example)(TElement element, TPriority priority)
vs.KeyValuePair<TPriority, TElement>
Peek
andDequeue
rather haveout
argument instead of tuple?Peek
andDequeue
throwing useful at all?Original Proposal
Issue https://github.com/dotnet/corefx/issues/163 requested the addition of a priority queue to the core .NET collection data structures.
This post, while a duplicate, is intended to act the formal submission to the corefx API Review Process. The issue contents are the speclet for a new System.Collections.Generic.PriorityQueue type.
I will be contributing the PR, if approved.
Rationale and Usage
The .NET Base Class Libraries (BCL) currently lacks support for ordered producer-consumer collections. A common requirement of many software applications is the ability generate a list of items over time and process them in an order different from the order they were received in.
There are three generic data structures within the System.Collections hierarchy of namespaces that supported a sorted collection of items; System.Collections.Generic.SortedList, System.Collections.Generic.SortedSet, and System.Collections.Generic.SortedDictionary.
Of these, SortedSet and SortedDictionary are not appropriate for producer-consumer patterns that generate duplicate values. The complexity of SortedList is Θ(n) worst case for both Add and Remove.
A much more memory and time efficient data structure for ordered collections with producer-consumer usage patterns is a priority queue. Other than when capacity resizing is necessary, worse case insertion (enqueue) and remove top (dequeue) performance is Θ(log n) - far better than the existing options that exist in the BCL.
Priority queues have a wide degree of applicability across different classes of applications. The Wikipedia page on Priority Queues offers a list of many different well understand use cases. While highly specialized implementations may still require custom priority queue implementations, a standard implementation would cover a broad range of usage scenarios. Priority queues are particularly useful in scheduling the output of multiple producers, which is an important pattern in highly parallelized software.
It's worth noting that both the C++ standard library and Java offer priority queue functionality as part of their basic APIs.
Proposed API
Details
Open Questions
Updates