Kotlin / dataframe

Structured data processing in Kotlin
https://kotlin.github.io/dataframe/overview.html
Apache License 2.0
761 stars 48 forks source link

`type: KType` in `DataColumnImpl` mismatches actual values sometimes #713

Open Jolanrensen opened 4 weeks ago

Jolanrensen commented 4 weeks ago

type: KType in DataColumnImpl mismatches actual values in some cases. This can result in runtime exceptions and makes life difficult attempting to fix https://github.com/Kotlin/dataframe/issues/30 or https://github.com/Kotlin/dataframe/issues/704 where we assume the type always correctly represents the data. This issue relates to https://github.com/Kotlin/dataframe/issues/701 as well.

To discover these bugs, we can introduce a (debug-only!!) check in DataColumnImpl, like:

private infix fun <T> T?.matches(type: KType) =
    when {
        this == null -> type.isMarkedNullable
        this.isPrimitiveArray -> type.isPrimitiveArray &&
            this!!::class.qualifiedName == type.classifier?.let { (it as KClass<*>).qualifiedName }

        this.isArray -> type.isArray // cannot check the precise type of array
        else -> this!!::class.isSubclassOf(type.classifier as KClass<*>)
    }

init {
    if (DEBUG) {
        require(values.all { it matches type }) {
            val types = values.map { if (it == null) "Nothing?" else it!!::class.simpleName }.distinct()
            "Values of column '$name' have types '$types' which are not compatible given with column type '$type'"
        }
    }
}

At the moment of testing, I can find 8+ breaking tests in :core:

Edit: running it afresh (clean pull of master with check) I get 15 failing tests.

There is also an exception in :dataframe-jdbc: https://github.com/Kotlin/dataframe/issues/701

Jolanrensen commented 3 weeks ago

7/15 tests are fixed by https://github.com/Kotlin/dataframe/issues/727

Next are the pivot tests with 0 and true in Boolean columns. I suspect this is due to default() taking Any? and the pivot implementation not re-inferring the types of the columns after filling in a null with the given default in: df.pivot { city }.groupBy { name }.default(0).min().

Jolanrensen commented 2 weeks ago

After merging https://github.com/Kotlin/dataframe/issues/713 there are just 5 failing tests left: image