Open Jolanrensen opened 4 weeks ago
Very interesting idea and performance step forward, I suggest to start with a synthetic generated DataFrame with 1–10 columns with Ints, Longs, or something (better with the same type) and measure the average time/memory footprint of some performant operations before deep implementation.
I see how we could economy on memory, but not sure about speed on operations.
Also interesting to compare some non-default implementations as Multik or DirectByteBuffers
Something like this
import org.jetbrains.kotlinx.dataframe.api.*
import org.jetbrains.kotlinx.dataframe.DataFrame
import org.jetbrains.kotlinx.dataframe.api.filter
import org.jetbrains.kotlinx.dataframe.api.groupby
import org.jetbrains.kotlinx.dataframe.api.sortBy
import org.openjdk.jmh.annotations.*
import java.util.concurrent.TimeUnit
@State(Scope.Benchmark)
@Fork(1)
@Warmup(iterations = 5)
@Measurement(iterations = 10)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
open class DataFrameBenchmark {
@Param("1", "2", "5", "10")
var columnCount: Int = 0
private lateinit var df: DataFrame<*>
@Setup(Level.Trial)
fun setup() {
df = createDataFrame(columnCount, 1000000)
}
private fun createDataFrame(columnCount: Int, rowCount: Int): DataFrame<*> {
val columns = (1..columnCount).map { "col$it" to DoubleArray(rowCount) { Math.random() } }
return dataFrameOf(*columns.toTypedArray())
}
@Benchmark
fun filter(): DataFrame<*> {
return df.filter { it["col1"] gt 0.5 }
}
@Benchmark
fun groupBy(): DataFrame<*> {
return df.groupBy("col1").mean()
}
@Benchmark
fun sortBy(): DataFrame<*> {
return df.sortBy("col1")
}
}
plugins {
kotlin("jvm") version "---"
}
repositories {
mavenCentral()
}
dependencies {
implementation("org.jetbrains.kotlinx:kotlinx-dataframe:---")
implementation("org.openjdk.jmh:jmh-core:---")
annotationProcessor("org.openjdk.jmh:jmh-generator-annprocess:---")
testImplementation(kotlin("test"))
}
./gradlew jmh
Fixes https://github.com/Kotlin/dataframe/issues/30, one of our oldest issues.
I introduced
ColumnDataHolder
to replace theList
inDataColumnImpl
. This interface can define how the data of columns is stored.ColumnDataHolderImpl
was created as default implementation and it defaults to store data in primitive arrays whenever possible. Other implementations might be possible in the future as well (to make DF act on top of an existing DB for instance).Things to be done:
ColumnDataHolder
s directly wherever possible instead ofList
s.DataColumnImpl.type
mismatchesDataColumnImpl.values
: https://github.com/Kotlin/dataframe/issues/713