Kotlin / dataframe

Structured data processing in Kotlin
https://kotlin.github.io/dataframe/overview.html
Apache License 2.0
845 stars 63 forks source link

☂ Statistics streamlining #961

Open Jolanrensen opened 1 day ago

Jolanrensen commented 1 day ago

Continuation of https://github.com/Kotlin/dataframe/issues/558 which fixed the most annoying bugs related to describe.

See https://github.com/Kotlin/dataframe/issues/558 for more information.

Our statistics functions need some more love. We used to have many missing types (mostly fixed by https://github.com/Kotlin/dataframe/pull/937), but there are yet some more inconsistencies to be solved:

As mentioned here https://github.com/Kotlin/dataframe/issues/543, some functions like median(ints) might result in an unexpectedly rounded Int in return. It might be better to let all functions return Double and then handle BigInteger / BigDecimal separately for now, as they're java-specific for now.

There are plenty of public overloads on Iterable and Sequence. It's fine to have them internally, but I feel like we're clogging the public scope here. mean, for instance, is already covered in the stdlib.

We'll need to hide public functions that are not on DataColumn as @AndreiKingsley will probably make a statistics library for that anyway.

We need to honor some conversion table (see below)

Function Conversion extra information nulls in input
mean Int -> Double All nulls are filtered out
Short -> Double
Byte -> Double
Long -> Double
Double -> Double skipNaN option, false by default
Float -> Double skipNaN option, false by default
BigInteger -> BigDecimal? null instead of NaN in output
BigDecimal -> BigDecimal? null instead of NaN in output
Number -> Double skipNaN option, false by default
Nothing / no values -> Double (NaN)
sum Int -> Int All default to zero if no values All nulls are filtered out
Short -> Int
Byte -> Int
Long -> Long
Double -> Double skipNaN option, false by default
Float -> Float skipNaN option, false by default
BigInteger -> BigInteger
BigDecimal -> BigDecimal
Number -> Double skipNaN option, false by default
Nothing / no values -> Double (0.0)
cumSum Int -> Int All default to zero if no values All can optionally skip nulls in input with skipNull option, true by default
Short -> Int important because order matters with cumSum
Byte -> Int
Long -> Long
Double -> Double skipNaN option, true by default
Float -> Float skipNaN option, true by default
BigInteger -> BigInteger
BigDecimal -> BigDecimal
Number -> Double skipNaN option, true by default
Nothing / no values -> Double (0.0)
min/max T -> T? where T : Comparable\<T> For all: null if no elements All nulls are filtered out
Int -> Int?
Short -> Short?
Byte -> Byte?
Long -> Long?
Double -> Double? If has NaN, result will be NaN, needs skipNaN option?
Float -> Float? If has NaN, result will be NaN, needs skipNaN option?
BigInteger -> BigInteger?
BigDecimal -> BigDecimal?
Number -> Double? If has NaN, result will be NaN, needs skipNaN option?
Nothing / no values -> Double? (null)
(Don't convert Short/Byte to Int!)
median T -> T? where T : Comparable\<T> For all: median of even list will cause conversion to Double All nulls are filtered out
Int -> Double? and null if no elements
Short -> Double?
Byte -> Double?
Long -> Double?
Double -> Double?
Float -> Double?
BigInteger -> BigDecimal?
BigDecimal -> BigDecimal?
Number -> Double?
Nothing / no values -> Double? (null)
std Int -> Double All have DDoF (Delta Degrees of Freedom) argument All nulls are filtered out
Short -> Double
Byte -> Double
Long -> Double
Double -> Double skipNaN option, false by default
Float -> Double skipNaN option, false by default
BigInteger -> BigDecimal? null instead of NaN in output
BigDecimal -> BigDecimal? null instead of NaN in output
Number -> Double skipNaN option, false by default
Nothing / no values -> Double (NaN)
var (want to add?) same as std