fslaborg / FSharp.Stats

statistical testing, linear algebra, machine learning, fitting and signal processing in F#
https://fslab.org/FSharp.Stats/
Other
210 stars 56 forks source link

[Performance] Quantile functions are very slow if there are lots of repeated values. #190

Closed nhirschey closed 2 years ago

nhirschey commented 2 years ago

Describe the bug I found that if an array has lots of repeated values, Quantile.compute seems to go into an infinite loop when called. I intend to fix #186 this weekend, so I can look at this too, but it's not immediately obvious to me what the issue is. I assume it's something to do with Array.quickSelectInPlace.

To Reproduce

#r "nuget: FSharp.Stats"
open FSharp.Stats
let testArray = Array.append (Array.init 10_000 (fun _ -> 0.0)) [|1.0|]
Quantile.compute 0.5 testArray // seems to be in infinite loop

Compare with MathNet

#r "nuget: MathNet.Numerics"
#r "nuget: MathNet.Numerics.FSharp"

open MathNet.Numerics.Statistics

let testArray = Array.append (Array.init 10_000 (fun _ -> 0.0)) [|1.0|]
Statistics.quantileFunc testArray 0.5  // evaluates to 0.0 immediately
bvenn commented 2 years ago

I've tried your snippet on my machine (Win11Pro, x64, 64GB, i7-11800H @ 2.3GHz) and it returned the correct result (0.0) in 3.5 seconds. However, when the array length was increased to 100,000+1 the computation time increased to 7 minutes.

At least it seems not to be a bug, but in this special case the performance is at least 'improvable' and should be investigated in future (having a quick check on the functions the problem is not obvious to me either).

nhirschey commented 2 years ago

Ok, yeah a performance issue is more accurate.

The real life example is students doing this for an assignment on datasets with a few million observations (every student has a different dataset of real life data, for example many firms do not pay dividends so if you want percentiles of dividends, you will have data with a lot of zeros), and a few were telling me that they cannot calculate percentiles for their data.

nhirschey commented 2 years ago

Turns out my example is 8 seconds on my laptop (very far from infinite!).

I was doing most of my testing on million observation data where I didn’t have the patience to see it complete and I knew it should be basically instant, but “infinite loop” hyperbole was not helpful. Sorry!

bvenn commented 2 years ago

I've found, that a type definition here:

https://github.com/fslaborg/FSharp.Stats/blob/71a5a91fc097a38e02bc42bd10b69dfeff65461d/src/FSharp.Stats/Array.fs#L71

to ... (arr:float array) : float = resolves that issue (at least for floats ;-) ). I'll have a look at it later

bvenn commented 2 years ago

I improved the speed by moving the recursive loop within the function body to inline the function.

let inline quickSelectInPlaceWith left right k (arr:array<'T>) : 'T =  
    let rec loop left right k (arr:array<'T>) : 'T =
    ...
Array length Time
10,000 00:00.077 minutes (77 milliseconds)
100,000 00:06.000 minutes
1,000,000 12:54.000 minutes

However the main issue remains. It is a common problem for quickpartition/quicksort to handle duplicates. The fix is to refine the partitionSortInPlace and/or quickSelectInPlaceWith and to add a third partition called the fat partition to hold items that are equal to the current pivot.

References: https://cs.stackexchange.com/questions/110201/refine-quickselect-with-median-of-medians-to-deal-with-duplicates https://cs.stackexchange.com/questions/104816/implementation-of-quicksort-to-handle-duplicates https://cs.fit.edu/~pkc/classes/writing/papers/bentley93engineering.pdf

As this requires detailed research of the algorithm I'm not sure when I am able to refine the function.

nhirschey commented 2 years ago

I understand. Thank you for taking the time to add this speed improvement and for doing the research (and providing links) to references for future performance improvements.

For my own future reference, the relevant code in mathnet.numerics is likely here.

bvenn commented 2 years ago

One additional remark

Statistics.quantileFunc is equally fast to Quantile.OfSorted.compute including the required array sorting. When the array consists of random values the modified Quantile.compute is faster than Statistics.quantileFunc.

#time
#r "nuget: MathNet.Numerics"
#r "nuget: MathNet.Numerics.FSharp"
#r "nuget: FSharp.Stats, 0.4.4"
// for modified loop according to the comment above
// #r @"C:\Users\bvenn\source\repos\FSharp.Stats\bin\FSharp.Stats\netstandard2.0\FSharp.Stats.dll"

open FSharp.Stats
open MathNet.Numerics.Statistics

let testArrayZero = Array.init 10_000_000 (fun _ -> 0.)

let rnd = System.Random()
let testArrayRand= Array.init 10_000_000 (fun _ -> rnd.NextDouble())

//Test zero array
//Quantile.compute 0.5 testArrayZero 
Statistics.quantileFunc testArrayZero 0.5
Quantile.OfSorted.compute 0.5 (Array.sort testArrayZero)

//Test random array
Quantile.compute 0.5 testArrayRand 
Statistics.quantileFunc testArrayRand 0.5
Quantile.OfSorted.compute 0.5 (Array.sort testArrayRand)
Function Zeros Random values
Quantile.compute - 140 milliseconds (modified loop)
Statistics.quantileFunc 140 milliseconds 780 milliseconds
Quantile.OfSorted.compute 140 milliseconds 770 milliseconds

So for your cases it seems beneficial to use the Quantile.OfSorted.compute function and when dealing with sequences without duplicates the Quantile.compute is superior. I will add an information to the function description until the performance issue is fixed.