fslaborg / FSharp.Stats

statistical testing, linear algebra, machine learning, fitting and signal processing in F#
https://fslab.org/FSharp.Stats/
Other
205 stars 54 forks source link

1) Make Quantile.compute (and others) use OfSorted.* internally and allow sequence input #192

Closed nhirschey closed 2 years ago

nhirschey commented 2 years ago

Two changes:

  1. Based on @bvenn's great insight that the OfSorted quantile functions are fast this, PR improves performance of Quantile.compute (and others) by internally first using Array.sortInPlace on the input and then passing that to the OfSorted versions. Is there any reason not to do this? (fixes #190)
  2. Allow the Quantile.compute and similar functions to take a sequence input (fixes #186)

Description

With this build:

open FSharp.Stats

let testArray = Array.append (Array.init 10_000_000 (fun _ -> 0.0)) [|1.0; nan|]
#time "on"
Quantile.compute 0.5 testArray 
// Real: 00:00:00.264, CPU: 00:00:00.312, GC gen0: 0, gen1: 0, gen2: 0

let rnd = System.Random()
let testArrayRand= Array.init 10_000_000 (fun _ -> rnd.NextDouble())

Quantile.compute 0.5 testArrayRand 
// Real: 00:00:01.094, CPU: 00:00:01.140, GC gen0: 0, gen1: 0, gen2: 0

[Required] please make sure you checked that

[Optional]

bvenn commented 2 years ago

Thanks for reporting the issue and updating the XML comments regarding the sorting.

Regarding 1: The performance of Quantile.compute now seems decreased due to the sorting step, that makes the function always slower than it's alternatives. The original Quantile.compute function is quick (as long as there are no duplicates) because there is no need for prior sorting steps.

Regarding 2: The generalization of the function is good but if called many times, the toArray conversion could slow down the evaluation if the input type already is of type array. A possible solution would be to check the input sequence type with .GetType().isArray and if true, no Seq.toArray is applied. But I have to check the performance difference if my hypothesis is true.

Thanks again, I'll come back at you soon :rocket:

nhirschey commented 2 years ago

Ok, fair enough. I understand you wanting to have the fast version available. I just pushed a changed that did the following:

  1. Reverted the changes that were always sorting the data and using OfSorted versions in order to recover the lost performance.
  2. I added a Seq.UtilityFunctions.toArrayQuick so that Array inputs won't be converted, taken from Don here so it's presumably the fastest way to do this.

So in summary, the changes in this PR now are 1) fixes comments and 2) allow seq input. In my unscientific tests it looks like it's back to the same performance as the developer branch on array inputs.

Thank you for your thoughtful consideration of my issues this week; I know I've already burnt a lot of your time on this!

Edit: I thought I marked Seq.UtilityFunctions.toArrayQuick as internal but I didn't. I'll leave as is so that I don't keep bothering you with notifications, but feel free to use this PR, reject it, or modify it in any way at your leisure.

bvenn commented 2 years ago

I'm always happy if FSharp.Stats is improved and I learned much on this journey, so it never is wasted time :+1: You didn't forget the internal, it is there :wink: Thanks for your time and work.

bvenn commented 2 years ago

To boost the performance further I added a toArrayCopyQuick function. If the input sequence is not an array, a new array is created and there is no need for copying. If the input is an array, it should be copied to not interfere with the in-place operations.

Additionally, I added unit tests for most of the Quantile functions so finally these module is covered :+1:

nhirschey commented 2 years ago

toArrayCopyQuick is a great idea.