Research: `ColumnDataHolder`/primitive arrays

Fixes https://github.com/Kotlin/dataframe/issues/30, one of our oldest issues.

I introduced ColumnDataHolder to replace the List in DataColumnImpl. This interface can define how the data of columns is stored. ColumnDataHolderImpl was created as default implementation and it defaults to store data in primitive arrays whenever possible. Other implementations might be possible in the future as well (to make DF act on top of an existing DB for instance).

Things to be done:

[ ] Let data sources create ColumnDataHolders directly wherever possible instead of Lists.
[ ] Fix cases where DataColumnImpl.type mismatches DataColumnImpl.values: https://github.com/Kotlin/dataframe/issues/713
[ ] Test performance/memory differences
[ ] Improve API

Very interesting idea and performance step forward, I suggest to start with a synthetic generated DataFrame with 1–10 columns with Ints, Longs, or something (better with the same type) and measure the average time/memory footprint of some performant operations before deep implementation.

I see how we could economy on memory, but not sure about speed on operations.

Also interesting to compare some non-default implementations as Multik or DirectByteBuffers

Something like this

import org.jetbrains.kotlinx.dataframe.api.*
import org.jetbrains.kotlinx.dataframe.DataFrame
import org.jetbrains.kotlinx.dataframe.api.filter
import org.jetbrains.kotlinx.dataframe.api.groupby
import org.jetbrains.kotlinx.dataframe.api.sortBy
import org.openjdk.jmh.annotations.*
import java.util.concurrent.TimeUnit

@State(Scope.Benchmark)
@Fork(1)
@Warmup(iterations = 5)
@Measurement(iterations = 10)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
open class DataFrameBenchmark {

    @Param("1", "2", "5", "10")
    var columnCount: Int = 0

    private lateinit var df: DataFrame<*>

    @Setup(Level.Trial)
    fun setup() {
        df = createDataFrame(columnCount, 1000000)
    }

    private fun createDataFrame(columnCount: Int, rowCount: Int): DataFrame<*> {
        val columns = (1..columnCount).map { "col$it" to DoubleArray(rowCount) { Math.random() } }
        return dataFrameOf(*columns.toTypedArray())
    }

    @Benchmark
    fun filter(): DataFrame<*> {
        return df.filter { it["col1"] gt 0.5 }
    }

    @Benchmark
    fun groupBy(): DataFrame<*> {
        return df.groupBy("col1").mean()
    }

    @Benchmark
    fun sortBy(): DataFrame<*> {
        return df.sortBy("col1")
    }
}

plugins {
    kotlin("jvm") version "---"
}

repositories {
    mavenCentral()
}

dependencies {
    implementation("org.jetbrains.kotlinx:kotlinx-dataframe:---")
    implementation("org.openjdk.jmh:jmh-core:---")
    annotationProcessor("org.openjdk.jmh:jmh-generator-annprocess:---")
    testImplementation(kotlin("test"))
}

./gradlew jmh

Kotlin / dataframe

Research: `ColumnDataHolder`/primitive arrays #712