Kotlin / dataframe

Structured data processing in Kotlin
https://kotlin.github.io/dataframe/overview.html
Apache License 2.0
761 stars 48 forks source link

Research: `ColumnDataHolder`/primitive arrays #712

Open Jolanrensen opened 4 weeks ago

Jolanrensen commented 4 weeks ago

Fixes https://github.com/Kotlin/dataframe/issues/30, one of our oldest issues.

I introduced ColumnDataHolder to replace the List in DataColumnImpl. This interface can define how the data of columns is stored. ColumnDataHolderImpl was created as default implementation and it defaults to store data in primitive arrays whenever possible. Other implementations might be possible in the future as well (to make DF act on top of an existing DB for instance).

Things to be done:

zaleslaw commented 3 weeks ago

Very interesting idea and performance step forward, I suggest to start with a synthetic generated DataFrame with 1–10 columns with Ints, Longs, or something (better with the same type) and measure the average time/memory footprint of some performant operations before deep implementation.

I see how we could economy on memory, but not sure about speed on operations.

Also interesting to compare some non-default implementations as Multik or DirectByteBuffers

Something like this

import org.jetbrains.kotlinx.dataframe.api.*
import org.jetbrains.kotlinx.dataframe.DataFrame
import org.jetbrains.kotlinx.dataframe.api.filter
import org.jetbrains.kotlinx.dataframe.api.groupby
import org.jetbrains.kotlinx.dataframe.api.sortBy
import org.openjdk.jmh.annotations.*
import java.util.concurrent.TimeUnit

@State(Scope.Benchmark)
@Fork(1)
@Warmup(iterations = 5)
@Measurement(iterations = 10)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
open class DataFrameBenchmark {

    @Param("1", "2", "5", "10")
    var columnCount: Int = 0

    private lateinit var df: DataFrame<*>

    @Setup(Level.Trial)
    fun setup() {
        df = createDataFrame(columnCount, 1000000)
    }

    private fun createDataFrame(columnCount: Int, rowCount: Int): DataFrame<*> {
        val columns = (1..columnCount).map { "col$it" to DoubleArray(rowCount) { Math.random() } }
        return dataFrameOf(*columns.toTypedArray())
    }

    @Benchmark
    fun filter(): DataFrame<*> {
        return df.filter { it["col1"] gt 0.5 }
    }

    @Benchmark
    fun groupBy(): DataFrame<*> {
        return df.groupBy("col1").mean()
    }

    @Benchmark
    fun sortBy(): DataFrame<*> {
        return df.sortBy("col1")
    }
}
plugins {
    kotlin("jvm") version "---"
}

repositories {
    mavenCentral()
}

dependencies {
    implementation("org.jetbrains.kotlinx:kotlinx-dataframe:---")
    implementation("org.openjdk.jmh:jmh-core:---")
    annotationProcessor("org.openjdk.jmh:jmh-generator-annprocess:---")
    testImplementation(kotlin("test"))
}
./gradlew jmh