Closed nhirschey closed 2 years ago
I've tried your snippet on my machine (Win11Pro, x64, 64GB, i7-11800H @ 2.3GHz) and it returned the correct result (0.0) in 3.5 seconds. However, when the array length was increased to 100,000+1 the computation time increased to 7 minutes.
At least it seems not to be a bug, but in this special case the performance is at least 'improvable' and should be investigated in future (having a quick check on the functions the problem is not obvious to me either).
Ok, yeah a performance issue is more accurate.
The real life example is students doing this for an assignment on datasets with a few million observations (every student has a different dataset of real life data, for example many firms do not pay dividends so if you want percentiles of dividends, you will have data with a lot of zeros), and a few were telling me that they cannot calculate percentiles for their data.
Turns out my example is 8 seconds on my laptop (very far from infinite!).
I was doing most of my testing on million observation data where I didn’t have the patience to see it complete and I knew it should be basically instant, but “infinite loop” hyperbole was not helpful. Sorry!
I've found, that a type definition here:
to ... (arr:float array) : float =
resolves that issue (at least for floats ;-) ). I'll have a look at it later
I improved the speed by moving the recursive loop within the function body to inline the function.
let inline quickSelectInPlaceWith left right k (arr:array<'T>) : 'T =
let rec loop left right k (arr:array<'T>) : 'T =
...
Array length | Time |
10,000 | 00:00.077 minutes (77 milliseconds) |
100,000 | 00:06.000 minutes |
1,000,000 | 12:54.000 minutes |
However the main issue remains. It is a common problem for quickpartition/quicksort to handle duplicates. The fix is to refine the partitionSortInPlace
and/or quickSelectInPlaceWith
and to add a third partition called the fat partition to hold items that are equal to the current pivot.
References: https://cs.stackexchange.com/questions/110201/refine-quickselect-with-median-of-medians-to-deal-with-duplicates https://cs.stackexchange.com/questions/104816/implementation-of-quicksort-to-handle-duplicates https://cs.fit.edu/~pkc/classes/writing/papers/bentley93engineering.pdf
As this requires detailed research of the algorithm I'm not sure when I am able to refine the function.
I understand. Thank you for taking the time to add this speed improvement and for doing the research (and providing links) to references for future performance improvements.
For my own future reference, the relevant code in mathnet.numerics is likely here.
Statistics.quantileFunc
is equally fast to Quantile.OfSorted.compute
including the required array sorting.
When the array consists of random values the modified Quantile.compute
is faster than Statistics.quantileFunc
.
#time
#r "nuget: MathNet.Numerics"
#r "nuget: MathNet.Numerics.FSharp"
#r "nuget: FSharp.Stats, 0.4.4"
// for modified loop according to the comment above
// #r @"C:\Users\bvenn\source\repos\FSharp.Stats\bin\FSharp.Stats\netstandard2.0\FSharp.Stats.dll"
open FSharp.Stats
open MathNet.Numerics.Statistics
let testArrayZero = Array.init 10_000_000 (fun _ -> 0.)
let rnd = System.Random()
let testArrayRand= Array.init 10_000_000 (fun _ -> rnd.NextDouble())
//Test zero array
//Quantile.compute 0.5 testArrayZero
Statistics.quantileFunc testArrayZero 0.5
Quantile.OfSorted.compute 0.5 (Array.sort testArrayZero)
//Test random array
Quantile.compute 0.5 testArrayRand
Statistics.quantileFunc testArrayRand 0.5
Quantile.OfSorted.compute 0.5 (Array.sort testArrayRand)
Function | Zeros | Random values |
Quantile.compute | - | 140 milliseconds (modified loop) |
Statistics.quantileFunc | 140 milliseconds | 780 milliseconds |
Quantile.OfSorted.compute | 140 milliseconds | 770 milliseconds |
So for your cases it seems beneficial to use the Quantile.OfSorted.compute
function and when dealing with sequences without duplicates the Quantile.compute
is superior. I will add an information to the function description until the performance issue is fixed.
Describe the bug I found that if an array has lots of repeated values, Quantile.compute seems to go into an infinite loop when called. I intend to fix #186 this weekend, so I can look at this too, but it's not immediately obvious to me what the issue is. I assume it's something to do with
Array.quickSelectInPlace
.To Reproduce
Compare with MathNet