Our statistics functions need some more love. We used to have many missing types (mostly fixed by https://github.com/Kotlin/dataframe/pull/937), but there are yet some more inconsistencies to be solved:
As mentioned here https://github.com/Kotlin/dataframe/issues/543, some functions like median(ints) might result in an unexpectedly rounded Int in return. It might be better to let all functions return Double and then handle BigInteger / BigDecimal separately for now, as they're java-specific for now.
There are plenty of public overloads on Iterable and Sequence. It's fine to have them internally, but I feel like we're clogging the public scope here. mean, for instance, is already covered in the stdlib.
We'll need to hide public functions that are not on DataColumn as @AndreiKingsley will probably make a statistics library for that anyway.
We need to honor some conversion table (see below)
Function
Conversion
extra information
nulls in input
mean
Int -> Double
All nulls are filtered out
Short -> Double
Byte -> Double
Long -> Double
Double -> Double
skipNaN option, false by default
Float -> Double
skipNaN option, false by default
BigInteger -> BigDecimal?
null instead of NaN in output
BigDecimal -> BigDecimal?
null instead of NaN in output
Number -> Double
skipNaN option, false by default
Nothing / no values -> Double (NaN)
sum
Int -> Int
All default to zero if no values
All nulls are filtered out
Short -> Int
Byte -> Int
Long -> Long
Double -> Double
skipNaN option, false by default
Float -> Float
skipNaN option, false by default
BigInteger -> BigInteger
BigDecimal -> BigDecimal
Number -> Double
skipNaN option, false by default
Nothing / no values -> Double (0.0)
cumSum
Int -> Int
All default to zero if no values
All can optionally skip nulls in input with skipNull option, true by default
Short -> Int
important because order matters with cumSum
Byte -> Int
Long -> Long
Double -> Double
skipNaN option, true by default
Float -> Float
skipNaN option, true by default
BigInteger -> BigInteger
BigDecimal -> BigDecimal
Number -> Double
skipNaN option, true by default
Nothing / no values -> Double (0.0)
min/max
T -> T? where T : Comparable\<T>
For all: null if no elements
All nulls are filtered out
Int -> Int?
Short -> Short?
Byte -> Byte?
Long -> Long?
Double -> Double?
If has NaN, result will be NaN, needs skipNaN option?
Float -> Float?
If has NaN, result will be NaN, needs skipNaN option?
BigInteger -> BigInteger?
BigDecimal -> BigDecimal?
Number -> Double?
If has NaN, result will be NaN, needs skipNaN option?
Nothing / no values -> Double? (null)
(Don't convert Short/Byte to Int!)
median
T -> T? where T : Comparable\<T>
For all: median of even list will cause conversion to Double
Continuation of https://github.com/Kotlin/dataframe/issues/558 which fixed the most annoying bugs related to
describe
.See https://github.com/Kotlin/dataframe/issues/558 for more information.
Our statistics functions need some more love. We used to have many missing types (mostly fixed by https://github.com/Kotlin/dataframe/pull/937), but there are yet some more inconsistencies to be solved: