Kotlin / dataframe

Structured data processing in Kotlin
https://kotlin.github.io/dataframe/overview.html
Apache License 2.0
822 stars 58 forks source link

Add support for percentiles #543

Open tklinchik opened 9 months ago

tklinchik commented 9 months ago

It appears that API already supports median() and medianFor() but not arbitrary percentiles. To make it on par with other DataFrame APIs it would desirable to have support for percentile(percentile = 0.95), etc.

Jolanrensen commented 9 months ago

Would that be comparable to quantile() in Pandas? https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.quantile.html

tklinchik commented 9 months ago

Yes, that would be ideal

cmelchior commented 8 months ago

I was looking a bit into this because it sounded like something that should be easy to add, but I ran into a few peculiarities.

This is the implementation for median:

@PublishedApi
internal inline fun <reified T : Comparable<T>> Iterable<T?>.median(type: KType): T? {
    val list = if (type.isMarkedNullable) filterNotNull() else (this as Iterable<T>).asList()
    val size = list.size
    if (size == 0) return null
    val index = size / 2
    if (index == 0 || size % 2 == 1) return list.quickSelect(index)
    return when (type.classifier) {
        Double::class -> ((list.quickSelect(index - 1) as Double + list.quickSelect(index) as Double) / 2.0) as T
        Int::class -> ((list.quickSelect(index - 1) as Int + list.quickSelect(index) as Int) / 2) as T
        Long::class -> ((list.quickSelect(index - 1) as Long + list.quickSelect(index) as Long) / 2L) as T
        Byte::class -> ((list.quickSelect(index - 1) as Byte + list.quickSelect(index) as Byte) / 2).toByte() as T
        BigDecimal::class -> ((list.quickSelect(index - 1) as BigDecimal + list.quickSelect(index) as BigDecimal) / BigDecimal(2)) as T
        else -> list.quickSelect(index - 1)
    }
}

For some reason, it doesn't handle Short, Float, and BigInteger, which leads to funny examples like this one, where the same data will return different values depending on the type:

    @Test
    fun `median of two columns`() {
        val df = dataFrameOf("a", "b")(
            1.0, 4.0,
            2.0, 6.0,
            7.0, 7.0,
            7.0, 7.0,
        )
        df.median("a", "b") shouldBe 5
    }

    @Test
    fun `median of Short values`() {
        val df = dataFrameOf("a", "b")(
            1.toShort(), 4.toShort(),
            2.toShort(), 6.toShort(),
            7.toShort(), 7.toShort()
        )
        df.median("a", "b") shouldBe 4
    }

Also, the code for Int will round data in perhaps slightly surprising ways, at least it doesn't match any of the Interpolation modes available in Panda. The reason for this seems to be that the result of the median calculation is cast back to Int even though Double would probably be a more accurate answer, i.e. this example:

    @Test
    fun `median of two columns`() {
        val df = dataFrameOf("a", "b")(
            1, 4,
            2, 6,
            9, 10,
            11, 12,
        )
        df.median("a", "b") shouldBe 7 // For pandas, lower = 6, higher = 9, midpoint = 7.5  
    }

This begs the question of what to do with a quantile method (which is just a generalization of median).

Would it make sense to be able to specify the return value as well or should people convert their List<T> before calling median?

I.e. something like this would be the most flexible:

public enum class Interpolation {
    LINEAR, LOWER, HIGHER, NEAREST, MIDPOINT
}
internal inline fun <reified T : Comparable<T>, reified R: Any> Iterable<T?>.quantile(type: KType, q: Float, interpolation: Interpolation = Interpolation.LINEAR): R? { TODO() }

// Usage
val median: Double = dataframe.quantile<Double>(q = 0.5) 
tklinchik commented 8 months ago

I believe the right way to approach this is to make quantile result to be always Double type since hard to imagine a limited use case for quantiles being of the same type as input. In a perfect world you'd even specify an interpolation method to use for determining a quantile as Pandas does it but midpoint or linear would be a good starting point. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.quantile.html

Also, worth mentioning that for large data frames it would be beneficial to request multiple quantiles in one pass such as:

fun quantile(quantiles: List<Double>): Map<Double, Double> since you'd limit the number of times you sort the values to being done only once. However, not sure how that impacts the DataFrame DSL though as it would likely make it more complicated.

cmelchior commented 8 months ago

For what it is worth, mean() also always returns Double, but for some reason, it doesn't support BigDecimal/BigInteger either (calculating the mean on BigDecimal would also need to return a BigDecimal)

So it looks like the API is going to be inconsistent no matter the solution. At least if we want to avoid a breaking change.

tklinchik commented 8 months ago

I agree, seems that if BigDecimal support is needed it may require a special treatment but everything else should be covered by Double. Another option is to make quantiles always return BigDecimal to make it consistent but that might be overkill.

Jolanrensen commented 8 months ago

@cmelchior I indeed also found we're missing some types in (some of) the statistics functions https://github.com/Kotlin/dataframe/issues/558.

I think returning Double's everywhere is fine as that's used by many libraries. BigDecimal/BigInteger are java-specific, so I don't mind if they get their own special treatment. That might also make the switch to multiplatform easier if we ever attempt that.

cmelchior commented 8 months ago

@Jolanrensen What is your stance on breaking changes? Would it be okay to change median() to return Double for Byte/Short/Int/Long/Float/Double?

Since median() also works on all Comparables (including String). These non-number types would probably just keep the existing behaviour?

Jolanrensen commented 8 months ago

@cmelchior I think the expected behaviour for users would be:

Double, Int, Long, Byte, Short, Float -> median as Double BigDecimal, BigInteger -> median as BigDecimal Number -> attempt .toDouble() and run median as Double Other comparable -> median as original type, middle of the list preferring the first of 2.

Might be best to implement this in a separate PR first though, as it solves a specific part of https://github.com/Kotlin/dataframe/issues/558 and it's indeed a breaking change, but for the better :) (hopefully, haha)

tklinchik commented 8 months ago

Here is my workaround at the moment that could be helpful:

import org.apache.commons.math3.stat.StatUtils

fun DataColumn<*>.quantile(quantile: Double): Double? {
    check(quantile >= 0.0 && quantile <= 1.0) { quantile }
    val values = when (typeClass) {
        Int::class    -> (values as Iterable<Int?>).filterNotNull().toList().map { it.toDouble() }
        Long::class   -> (values as Iterable<Long?>).filterNotNull().toList().map { it.toDouble() }
        Double::class -> (values as Iterable<Double?>).filterNotNull().toList()
        else          -> throw IllegalArgumentException("Not supported: $typeClass")
    }.toTypedArray().run { toDoubleArray() } 
    return StatUtils.percentile(values, quantile * 100.0)
}

fun DataFrame<*>.quantile(name: String, quantile: Double): Double? = this.get(name).quantile(quantile)

That allows me to use it as standalone and within aggregate {}