fslaborg / FSharp.Stats

statistical testing, linear algebra, machine learning, fitting and signal processing in F#
https://fslab.org/FSharp.Stats/
Other
210 stars 56 forks source link

Unify and fix rank module result types #183

Closed bvenn closed 2 years ago

bvenn commented 2 years ago

There are 4 ranking methods in FSharp.Stats.Rank:

example = [5,3,3,4,2]

  1. By now, all functions except rankFirst result in float arrays. The only function where floats can occur is rankAvg. For harmonization I would suggest, that rankFirst as well should report a float array, although it would be a breaking change.

  2. There seems to be an issue, that ties are not ranked correct here: https://github.com/fslaborg/FSharp.Stats/blob/46ab30aa63dd353ce1700de5d6b14f2115f29ca3/src/FSharp.Stats/Rank.fs#L65 It seems the +1 increment belongs to rankMax rather than rankMin. EDIT 01/02/22: THIS WAS A TEMPORAL LOCAL ERROR!

A fix is on the way!

bvenn commented 2 years ago
  1. in line 23 there is abs (a-b) <= 0. Since an absolute cannot be smaller than 0 it can be changed to abs (a-b) = 0 https://github.com/fslaborg/FSharp.Stats/blob/46ab30aa63dd353ce1700de5d6b14f2115f29ca3/src/FSharp.Stats/Rank.fs#L23
bvenn commented 2 years ago

Apparently, restarting the computer fixed issue 2. Unit tests are added and changes are on their way.

bvenn commented 2 years ago

How to handle nan and infinity values within the sequence?

At this moment the ranking order is as follows:

  1. nan
  2. -infinity
  3. all real numbers
  4. infinity

nans and infinities are treated as individual elements:

let example = [|-infinity;1;nan;infinity;infinity|]
rankFirst =   [|    2;    3; 1;    4;       5    |]
rankMin   =   [|    2;    3; 1;    4;       5    |]
rankMax   =   [|    2;    3; 1;    4;       5    |]
rankAvg   =   [|    2;    3; 1;    4;       5    |] 

How do others handle nans?

Suggestion

I would recommend to assign nan ranks to nan values as default case.

let example = [|-infinity;1;nan;infinity;infinity|]
rankFirst =   [|    1;    2;nan;   3;       4    |]

What do you think? @muehlhaus @kMutagene @ZimmerD

muehlhaus commented 2 years ago

Yes, I think your suggestion is very good.

bvenn commented 2 years ago

I've just created a update-rank branch to solve all issues. By default nan is sorted to the start of a sequence. This corrupts the loop of the implemented version. There are two possibilities to solve it: