dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.29k stars 4.74k forks source link

Investigate optimizing some OrderBy().Take calls to use a PriorityQueue #96277

Open stephentoub opened 10 months ago

stephentoub commented 10 months ago

Order/OrderBy.Take is currently an O(N log N) operation that allocates an array of length N and keeps all the data around until the relevant portion of the data is consumed. With a PriorityQueue, however, for OrderBy(...).Take(T) this would instead be O(N log T), and it would only need to keep space proportional to T, not N, so if T is significantly smaller than N (which in typical use is the case, often with a small T value passed to Take), this could lead to significant throughput and memory consumption benefits.

Variations of this with OrderByDescending are also possible. It'd also be possible to accomodate OrderBy().Skip(S).Take(T); the priority queue would then need to maintain S+T values, so the benefits would diminish as S grew. However, from an algorithmic complexity perspective, S+T <= N, which means O(N log (S+T)) <= O(N log N). Of course, we'd still need to be careful, since the constants here do matter from a throughput perspective.

ghost commented 10 months ago

Tagging subscribers to this area: @dotnet/area-system-linq See info in area-owners.md if you want to be subscribed.

Issue Details
Order/OrderBy.Take is currently an O(N log N) operation that allocates an array of length N and keeps all the data around until the relevant portion of the data is consumed. With a PriorityQueue, however, for OrderBy(...).Take(T) this would instead be O(N log T), and it would only need to keep space proportional to T, not N, so if T is significantly smaller than N (which in typical use is the case, often with a small T value passed to Take), this could lead to significant throughput and memory consumption benefits. Variations of this with OrderByDescending are also possible. It'd also be possible to accomodate OrderBy().Skip(S).Take(T); the priority queue would then need to maintain S+T values, so the benefits would diminish as S grew. However, from an algorithmic complexity perspective, S+T <= N, which means O(N log (S+T)) <= O(N log N). Of course, we'd still need to be careful, since the constants here do matter from a throughput perspective.
Author: stephentoub
Assignees: -
Labels: `area-System.Linq`, `untriaged`
Milestone: -
eiriktsarpalis commented 10 months ago

At quick glance it seems we already have the infrastructure to accommodate such an optimization, essentially it would be us updating this method to use a PQ:

https://github.com/dotnet/runtime/blob/14127ea770cf748509b9057b7a8ed5a3471ebfed/src/libraries/System.Linq/src/System/Linq/OrderedEnumerable.cs#L34-L59

akade commented 10 months ago

Do you still want contributions on this? I'd be happy to explore that.

After a quick look at it, I have the following thoughts: The priority queue does not support removing leafs, correct? Hence, for the implementation, I think it would be something along the following lines:

Does that make sense? Or is there any way to avoid the need for reversing as it likely will require a second allocation of a maxIdx sized chunk.

One question remains, though. Afaik OrderBy guarantees stable sort but PriorityQueue doesn't give any guarantees for values with the same priority. We would need to account for that separately right?

stephentoub commented 10 months ago

OrderBy is already ordered min to max, and PriorityQueue is a min heap; there shouldn't be any need for reversal. There's also no need to Peek; if Count == take limit, just EnqueueDequeue. And at the end of the enumeration, just TryDequeue until it returns false. For stability, yes, it needs to account for it; easiest way is to store the element's index as part of the priority and use it as a tie breaker if the elements otherwise compare equal.

I have a prototype locally, but if you'd like to try / take over, have at it :) The hardest part is the heuristic for when to use this approach.

akade commented 10 months ago

As I would probably have some time the next few days, I'd like to have a go :)

I played around with a bit of code and understand now, that no peeking is necessary and keeping the sort stable is trivial using a combined prio.

However, I'm still not sure what I missed with regards to inverting the queue:

If you take the following random input [6, 3, 0, 6, 7, 1, 4, 9, 3, 2] and use a priority queue with just EnqueueDequeue when reaching the limit of 3, you will get [6,7,9] instead of [0,1,2].

EnqueueDequeue() removes the smallest element, i.e. in this case 0 when enqueuing the second 6 and smaller values are immediately discarded. Accordingly, just looping for TryDequeue returns ascending values, starting by the min in the heap.

The (isolated) code that I used:

static IEnumerable<T> TakeOrdered<T, TKey>(IEnumerable<T> source, Func<T, TKey> keySelector, int count)
{
    PriorityQueue<T, (TKey key, int sourcePos)> queue = new(count);

    int i = 0;
    foreach (T item in source)
    {
        if (queue.Count == count)
        {
            queue.EnqueueDequeue(item, (keySelector(item), i));
        }
        else
        {
            queue.Enqueue(item, (keySelector(item), i));
        }

        i++;
    }

    while(queue.TryDequeue(out T? elem, out _))
    {
        yield return elem;
    }
}

Reversing the comparison will result in [2, 1, 0] instead of [0, 1, 2]...

stephentoub commented 10 months ago

However, I'm still not sure what I missed with regards to inverting the queue:

You want to use a comparer that inverts the comparison (ideally by implementing a custom struct TPriority with a Compare that inverts the comparison); that then ensures that EnqueueDequeue will remove the largest elements such that the smallest remain in the queue. But then, you're right, you'll need to reverse the final order of elements retrieved with TryDequeue at the end, as they'll emerge largest to smallest rather than smallest to largest. The theory is that this optimization will be used when the Take(T) count is a relatively small number, and the reversal either can possibly be done with stack memory or won't even require additional memory because, e.g. it's part of a ToArray or ToList where the elements can either be stored into their destination slot directly or alternatively reversed in the destination storage.

akade commented 10 months ago

Update from my side:

I wrote a first version in isolation (against .NET 8): Method NumberOfPeople AmountTaken AmountSkipped Mean Error StdDev Gen0 Gen1 Allocated
OrderByValueType 512 20 0 2.697 us 0.0162 us 0.0136 us 0.6294 0.0114 10528 B
OrderByValueType_New 512 20 0 2.031 us 0.0342 us 0.0320 us 0.0420 - 728 B
OrderByValueType 512 20 128 4.866 us 0.0167 us 0.0130 us 0.6256 0.0076 10560 B
OrderByValueType_New 512 20 128 2.193 us 0.0191 us 0.0170 us 0.2251 - 3800 B
More results | Method | NumberOfPeople | AmountTaken | AmountSkipped | Mean | Error | StdDev | Gen0 | Gen1 | Allocated | |--------------------- |--------------- |------------ |-------------- |----------:|----------:|----------:|-------:|-------:|----------:| | OrderByValueType | 512 | 10 | 0 | 2.669 us | 0.0408 us | 0.0361 us | 0.6294 | 0.0114 | 10528 B | | OrderByValueType_New | 512 | 10 | 0 | 2.114 us | 0.0071 us | 0.0067 us | 0.0267 | - | 488 B | | OrderByValueType | 512 | 20 | 0 | 2.697 us | 0.0162 us | 0.0136 us | 0.6294 | 0.0114 | 10528 B | | OrderByValueType_New | 512 | 20 | 0 | 2.031 us | 0.0342 us | 0.0320 us | 0.0420 | - | 728 B | | OrderByValueType | 512 | 128 | 0 | 6.731 us | 0.0332 us | 0.0310 us | 0.6256 | 0.0076 | 10528 B | | OrderByValueType_New | 512 | 128 | 0 | 2.180 us | 0.0144 us | 0.0127 us | 0.1984 | - | 3320 B | | OrderByValueType | 512 | 256 | 0 | 9.985 us | 0.0381 us | 0.0318 us | 0.6256 | - | 10528 B | | OrderByValueType_New | 512 | 256 | 0 | 2.247 us | 0.0252 us | 0.0236 us | 0.3815 | 0.0076 | 6392 B | | OrderByValueType | 512 | 512 | 0 | 15.035 us | 0.0864 us | 0.0765 us | 0.6256 | - | 10528 B | | OrderByValueType_New | 512 | 512 | 0 | 2.345 us | 0.0355 us | 0.0332 us | 0.7477 | 0.0305 | 12536 B | | OrderByValueType | 512 | 10 | 128 | 4.825 us | 0.0375 us | 0.0351 us | 0.6256 | 0.0076 | 10560 B | | OrderByValueType_New | 512 | 10 | 128 | 2.086 us | 0.0092 us | 0.0081 us | 0.2098 | - | 3560 B | | OrderByValueType | 512 | 20 | 128 | 4.866 us | 0.0167 us | 0.0130 us | 0.6256 | 0.0076 | 10560 B | | OrderByValueType_New | 512 | 20 | 128 | 2.193 us | 0.0191 us | 0.0170 us | 0.2251 | - | 3800 B | | OrderByValueType | 512 | 128 | 128 | 7.138 us | 0.0450 us | 0.0399 us | 0.6256 | 0.0076 | 10560 B | | OrderByValueType_New | 512 | 128 | 128 | 2.177 us | 0.0416 us | 0.0478 us | 0.3815 | 0.0076 | 6392 B | | OrderByValueType | 512 | 256 | 128 | 10.607 us | 0.0890 us | 0.0832 us | 0.6256 | - | 10560 B | | OrderByValueType_New | 512 | 256 | 128 | 2.304 us | 0.0352 us | 0.0329 us | 0.5646 | 0.0191 | 9464 B | | OrderByValueType | 512 | 512 | 128 | 13.314 us | 0.0530 us | 0.0496 us | 0.6256 | - | 10560 B | | OrderByValueType_New | 512 | 512 | 128 | 2.434 us | 0.0313 us | 0.0261 us | 0.9308 | 0.0534 | 15608 B |

While a bit different than expected and with room for improvement, it looks very promising. So I went along and implemented it within the real system. However, the results are completely different there:

Before: Method NumberOfPeople AmountTaken AmountSkipped Mean Error StdDev Median Min Max Gen0 Gen1 Allocated
OrderByValueType 512 20 0 3,169.08 ns 18.857 ns 17.639 ns 3,168.06 ns 3,145.39 ns 3,207.83 ns 0.6181 0.0126 10.28 KB
OrderByValueType 512 20 128 3,965.31 ns 76.728 ns 78.794 ns 3,959.77 ns 3,872.22 ns 4,131.64 ns 0.6277 - 10.31 KB
More results | Method | NumberOfPeople | AmountTaken | AmountSkipped | Mean | Error | StdDev | Median | Min | Max | Gen0 | Gen1 | Allocated | |----------------- |--------------- |------------ |-------------- |------------:|----------:|----------:|------------:|------------:|------------:|-------:|-------:|----------:| | OrderByValueType | 128 | 10 | 0 | 1,069.33 ns | 13.612 ns | 12.733 ns | 1,063.60 ns | 1,056.67 ns | 1,097.17 ns | 0.1665 | - | 2.78 KB | | OrderByValueType | 128 | 10 | 128 | 78.08 ns | 0.838 ns | 0.784 ns | 77.90 ns | 76.75 ns | 79.66 ns | 0.0739 | - | 1.21 KB | | OrderByValueType | 128 | 20 | 0 | 1,410.42 ns | 14.162 ns | 13.247 ns | 1,408.20 ns | 1,389.74 ns | 1,435.97 ns | 0.1702 | - | 2.78 KB | | OrderByValueType | 128 | 20 | 128 | 80.11 ns | 1.244 ns | 1.164 ns | 80.23 ns | 78.31 ns | 82.10 ns | 0.0740 | - | 1.21 KB | | OrderByValueType | 256 | 10 | 0 | 1,591.40 ns | 17.533 ns | 16.400 ns | 1,589.09 ns | 1,567.72 ns | 1,615.40 ns | 0.3225 | - | 5.28 KB | | OrderByValueType | 256 | 10 | 128 | 2,178.09 ns | 15.293 ns | 13.557 ns | 2,175.82 ns | 2,161.71 ns | 2,210.39 ns | 0.3210 | - | 5.31 KB | | OrderByValueType | 256 | 20 | 0 | 1,701.12 ns | 18.830 ns | 16.692 ns | 1,694.16 ns | 1,682.97 ns | 1,737.79 ns | 0.3205 | - | 5.28 KB | | OrderByValueType | 256 | 20 | 128 | 2,335.03 ns | 23.073 ns | 21.582 ns | 2,325.46 ns | 2,308.20 ns | 2,372.47 ns | 0.3231 | - | 5.31 KB | | OrderByValueType | 512 | 10 | 0 | 2,995.15 ns | 22.336 ns | 19.801 ns | 2,995.85 ns | 2,951.57 ns | 3,026.48 ns | 0.6284 | 0.0119 | 10.28 KB | | OrderByValueType | 512 | 10 | 128 | 3,744.72 ns | 28.800 ns | 25.530 ns | 3,743.69 ns | 3,712.89 ns | 3,808.58 ns | 0.6260 | - | 10.31 KB | | OrderByValueType | 512 | 20 | 0 | 3,169.08 ns | 18.857 ns | 17.639 ns | 3,168.06 ns | 3,145.39 ns | 3,207.83 ns | 0.6181 | 0.0126 | 10.28 KB | | OrderByValueType | 512 | 20 | 128 | 3,965.31 ns | 76.728 ns | 78.794 ns | 3,959.77 ns | 3,872.22 ns | 4,131.64 ns | 0.6277 | - | 10.31 KB |
~2-3x slower with priority queue: Method NumberOfPeople AmountTaken AmountSkipped Mean Error StdDev Median Min Max Gen0 Allocated
OrderByValueType 512 20 0 6.230 us 0.0480 us 0.0425 us 6.225 us 6.187 us 6.329 us 0.0498 912 B
OrderByValueType 512 20 128 12.693 us 0.0361 us 0.0320 us 12.687 us 12.650 us 12.770 us 0.2532 5040 B
More results | Method | NumberOfPeople | AmountTaken | AmountSkipped | Mean | Error | StdDev | Median | Min | Max | Gen0 | Allocated | |----------------- |--------------- |------------ |-------------- |----------:|----------:|----------:|----------:|----------:|----------:|-------:|----------:| | OrderByValueType | 128 | 10 | 0 | 1.929 us | 0.1280 us | 0.1474 us | 1.877 us | 1.779 us | 2.195 us | 0.0319 | 592 B | | OrderByValueType | 128 | 10 | 128 | 4.551 us | 0.0481 us | 0.0427 us | 4.561 us | 4.490 us | 4.618 us | 0.2719 | 4640 B | | OrderByValueType | 128 | 20 | 0 | 2.178 us | 0.0160 us | 0.0125 us | 2.177 us | 2.160 us | 2.194 us | 0.0519 | 912 B | | OrderByValueType | 128 | 20 | 128 | 4.598 us | 0.0671 us | 0.0628 us | 4.562 us | 4.537 us | 4.701 us | 0.2854 | 4880 B | | OrderByValueType | 256 | 10 | 0 | 3.096 us | 0.0116 us | 0.0102 us | 3.091 us | 3.085 us | 3.118 us | 0.0247 | 592 B | | OrderByValueType | 256 | 10 | 128 | 7.838 us | 0.1283 us | 0.1200 us | 7.809 us | 7.712 us | 8.114 us | 0.2542 | 4720 B | | OrderByValueType | 256 | 20 | 0 | 3.514 us | 0.0124 us | 0.0110 us | 3.514 us | 3.502 us | 3.536 us | 0.0420 | 912 B | | OrderByValueType | 256 | 20 | 128 | 8.106 us | 0.1318 us | 0.1168 us | 8.065 us | 7.988 us | 8.361 us | 0.2905 | 5040 B | | OrderByValueType | 512 | 10 | 0 | 5.772 us | 0.1035 us | 0.1062 us | 5.730 us | 5.703 us | 6.045 us | 0.0228 | 592 B | | OrderByValueType | 512 | 10 | 128 | 12.315 us | 0.0979 us | 0.0868 us | 12.322 us | 12.154 us | 12.454 us | 0.2478 | 4720 B | | OrderByValueType | 512 | 20 | 0 | 6.230 us | 0.0480 us | 0.0425 us | 6.225 us | 6.187 us | 6.329 us | 0.0498 | 912 B | | OrderByValueType | 512 | 20 | 128 | 12.693 us | 0.0361 us | 0.0320 us | 12.687 us | 12.650 us | 12.770 us | 0.2532 | 5040 B | Note that the heuristic is very simple at the moment, and the capacity passed to the priority collection is suboptimal when skipping large amounts even when the count would be known in advance.

I shelved it and came back to it multiple times but have not yet understood what is going on. How do you usually profile such things? Are you using ETWProfiler? Any pointers before I go off into the wrong direction would be appriciated :)

eiriktsarpalis commented 10 months ago

Would it be possible to share your changes?

akade commented 9 months ago

Would it be possible to share your changes?

Of course, I created a Draft PR for easy viewing: #97277

sfiruch commented 8 months ago

While heap sort has the (in a sense) optimal asymptotic complexity of O(n log n), the constant factor is fairly high. This is why many people tend to use sorting algorithms other than heap sort.

When n is small, perhaps insertion sort might even beat PriorityQueue. People usually pick a threshold between 5 and 20, so for fewer elements people use insertion sort, rather than QuickSort. Please note that QuickSort usually has a lower constant factor than HeapSort, so I'd expect the threshold to be higher in our case.

In short: I'd use insertion sort when we know n < 50 (or so).

LeaFrock commented 6 months ago

Do make sense while N is a large number and T is much smaller(T << N).

For time complexity, it expects log(T) << log(N) which means a higher threshold of T << N. As the PriorityQueue depends on 4-ary heap, I guess the different $k$ of $log_k(N)$ between heap sort and quick sort is also a factor for consideration. It shall be a useful reference to the performance testing cases and threshold finalization.

For space complexity, it's really a brilliant idea. N->T is the most valuable change from the perspective of my view.

The final result could be like,

The side effect is that, make source codes much more complex 🤣.

PranavSenthilnathan commented 1 day ago

Just a small clarification, the current implementation is actually O(N + T log T), not O(N log N) because quickselect is O(n) and allows only needing to sort T elements.

Since O(N log (S+T)) >= O(N log T), O(N log T) >= O(N) and O(N log T) >= O(T log T), we get O(N log (S+T)) >= O(N + T log T) which means that the PQ is never asymptotically better for any T or S.

But as mentioned, constants do matter, so for very small T and S there could be optimization, but even in that case we might want to just scope this to insertion sort + lookup as @LeaFrock suggested (Max/Min is just a degenerate case of this) instead of a priority queue.