Kotlin / dataframe

Structured data processing in Kotlin
https://kotlin.github.io/dataframe/overview.html
Apache License 2.0
761 stars 48 forks source link

`df.describe().describe()` crashes #724

Closed cmelchior closed 3 weeks ago

cmelchior commented 3 weeks ago

Ran into this by accident while testing AI Assistant.

It looks like the output of df.describe() creates a DataFrame with some invalid types. I managed to reproduce crashes in two cases:

%use dataframe
val df = DataFrame.read("https://raw.githubusercontent.com/Kotlin/dataframe/master/data/jetbrains_repositories.csv")
df.describe().describe()

// Output in Notebook
null
java.lang.ClassCastException

I also reproduced this in a unit test:

    @Test
    fun describeTwice() {
        val df = typed.group { age and weight }.into("info").groupBy { city }.toDataFrame()
        val desc = df.describe().describe()
        desc::class shouldBe DataFrame::class
    }

Which crashed with:

class java.lang.Integer cannot be cast to class java.lang.String (java.lang.Integer and java.lang.String are in module java.base of loader 'bootstrap')
java.lang.ClassCastException: class java.lang.Integer cannot be cast to class java.lang.String (java.lang.Integer and java.lang.String are in module java.base of loader 'bootstrap')
    at java.base/java.lang.String.compareTo(String.java:140)
    at kotlin.sequences.SequencesKt___SequencesKt.minOrNull(_Sequences.kt:2097)
    at org.jetbrains.kotlinx.dataframe.api.MinKt.minOrNull(min.kt:27)
    at org.jetbrains.kotlinx.dataframe.impl.api.DescribeKt$describeImpl$df$1$20.invoke(describe.kt:79)
    at org.jetbrains.kotlinx.dataframe.impl.api.DescribeKt$describeImpl$df$1$20.invoke(describe.kt:79)
    at org.jetbrains.kotlinx.dataframe.impl.api.DescribeKt$describeImpl$df$1.invoke(describe.kt:183)
    at org.jetbrains.kotlinx.dataframe.impl.api.DescribeKt$describeImpl$df$1.invoke(describe.kt:62)
    at org.jetbrains.kotlinx.dataframe.impl.api.ToDataFrameKt.createDataFrameImpl(toDataFrame.kt:155)
    at org.jetbrains.kotlinx.dataframe.impl.api.DescribeKt.describeImpl(describe.kt:113)
    at org.jetbrains.kotlinx.dataframe.api.DescribeKt.describe(describe.kt:45)
    at org.jetbrains.kotlinx.dataframe.api.DescribeKt.describe(describe.kt:42)
    at org.jetbrains.kotlinx.dataframe.testSets.person.DataFrameTests.describeTwice(DataFrameTests.kt:2364)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:568)
    at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
    at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
    at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
    at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
    at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
    at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
    at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
    at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
    at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
    at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
    at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
    at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
    at org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.runTestClass(JUnitTestClassExecutor.java:112)
    at org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:58)
    at org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:40)
    at org.gradle.api.internal.tasks.testing.junit.AbstractJUnitTestClassProcessor.processTestClass(AbstractJUnitTestClassProcessor.java:60)
    at org.gradle.api.internal.tasks.testing.SuiteTestClassProcessor.processTestClass(SuiteTestClassProcessor.java:52)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:568)
    at org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:36)
    at org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24)
    at org.gradle.internal.dispatch.ContextClassLoaderDispatch.dispatch(ContextClassLoaderDispatch.java:33)
    at org.gradle.internal.dispatch.ProxyDispatchAdapter$DispatchingInvocationHandler.invoke(ProxyDispatchAdapter.java:94)
    at jdk.proxy1/jdk.proxy1.$Proxy2.processTestClass(Unknown Source)
    at org.gradle.api.internal.tasks.testing.worker.TestWorker$2.run(TestWorker.java:176)
    at org.gradle.api.internal.tasks.testing.worker.TestWorker.executeAndMaintainThreadName(TestWorker.java:129)
    at org.gradle.api.internal.tasks.testing.worker.TestWorker.execute(TestWorker.java:100)
    at org.gradle.api.internal.tasks.testing.worker.TestWorker.execute(TestWorker.java:60)
    at org.gradle.process.internal.worker.child.ActionExecutionWorker.execute(ActionExecutionWorker.java:56)
    at org.gradle.process.internal.worker.child.SystemApplicationClassLoaderWorker.call(SystemApplicationClassLoaderWorker.java:113)
    at org.gradle.process.internal.worker.child.SystemApplicationClassLoaderWorker.call(SystemApplicationClassLoaderWorker.java:65)
    at worker.org.gradle.process.internal.worker.GradleWorkerMain.run(GradleWorkerMain.java:69)
    at worker.org.gradle.process.internal.worker.GradleWorkerMain.main(GradleWorkerMain.java:74)
Jolanrensen commented 3 weeks ago

This is a very interesting bug!

@Test
fun `describe twice 1`() {
    val df = dataFrameOf("a", "b")(1, 2, 3, 4)
    val desc1 = df.describe()
    val desc2 = desc1.describe()
    desc2::class shouldBe DataFrameImpl::class
}

works fine, but

@Test
fun `describe twice 2`() {
    val df = dataFrameOf("a", "b")(1, "foo", 3, "bar")
    val desc1 = df.describe()
    val desc2 = desc1.describe()
    desc2::class shouldBe DataFrameImpl::class
}

breaks.

I suspect this is due to columns being created with type Comparable<*>/Comparable<Nothing> after running describe():

   name:String type:Any count:Int unique:Int nulls:Int top:Comparable<*> freq:Int mean:Double? std:Double? min:Comparable<*> median:Comparable<*> max:Comparable<*>
 0           a      Int         2          2         0                 1        1          2.0    1.414214                 1                    2                 3
 1           b   String         2          2         0               foo        1         null        null               bar                  bar               foo

If you now run another describe() on this table, it will try to find the min of columns like top and compare Int and String. However, these two are incomparable, as we can see by the exception.

Our current implementation only checks if a column AnyCol.isComparable() = isSubtypeOf<Comparable<*>?>(), not whether the type T != Nothing. I'm not sure we can actually.

koperagen commented 3 weeks ago

Our current implementation only checks if a column AnyCol.isComparable() = isSubtypeOf<Comparable<*>?>(), not whether the type T != Nothing. I'm not sure we can actually.

Maybe typeOf<Comparable<Any?>?>() will work as expected here. I suspect this code was written with an assumption that means Any?, but Comparable has in variance and `Comparable<>==Comparableand from type system perspective you can't compare twoComparable` image image

Jolanrensen commented 3 weeks ago

@koperagen thanks for the tip! But unfortunately:

typeOf<Int>().isSubtypeOf(typeOf<Comparable<Any?>>()) == false
typeOf<Int>().isSubtypeOf(typeOf<Comparable<Any>>()) == false

variance is fun :)

Jolanrensen commented 3 weeks ago

It can be fixed like:

/**
 * Returns `true` if [this] column is comparable, i.e. its type is a subtype of [Comparable] and its
 * type argument is not [Nothing].
 */
public fun AnyCol.isComparable(): Boolean = isSubtypeOf<Comparable<*>?>()
    && type().projectTo(Comparable::class).arguments[0].let {
        it != KTypeProjection.STAR &&
            it.type?.isNothing != true
    }

I'll probably make a PR later :)