Kotlin / dataframe

Structured data processing in Kotlin
https://kotlin.github.io/dataframe/overview.html
Apache License 2.0
784 stars 50 forks source link

Reading Arrow NullVector #550

Closed Kopilov closed 5 months ago

Kopilov commented 6 months ago

Apache Arrow files might contain NullVector values (as result of saving null-infilled column in other libraries and languages without static types and target schema). From this PR they will be correctly read by Kotlin DataFrame instead of crushing. Also we can make saving to NullVectors, should we?

Among others, Arrow itself is upgraded to last stable version (14.0.2) and #428 problem is fixed for Arrow writing by replacing original hasNulls function with custom explicit checking.

Kopilov commented 5 months ago

Merge branch 'master' into NullVector

Should we apply git rebase instead (to avoid spaghetti-like history)?

Kopilov commented 5 months ago

Rebased

Jolanrensen commented 5 months ago

Thanks for the help! I'll run the CI and merge :)

Jolanrensen commented 5 months ago

@Kopilov Looks like the test org.jetbrains.kotlinx.dataframe.io.ArrowKtTest.testReadingAllTypesAsEstimatedNotNullableWithNulls now fails:

org.junit.ComparisonFailure: expected:<kotlin.Nothing?> but was:<kotlin.Nothing> expected:<kotlin.Nothing[?]> but was:<kotlin.Nothing[]>
  at org.jetbrains.kotlinx.dataframe.io.ExampleEstimatesAssertionsKt.assertEstimations(exampleEstimatesAssertions.kt:163)
  at org.jetbrains.kotlinx.dataframe.io.ArrowKtTest.testReadingAllTypesAsEstimatedNotNullableWithNulls(ArrowKtTest.kt:221)

This probably means we need my entire solution with NullabilityOptions after all... The tests are fine if I use this:

...

@JvmName("withTypeNullableNothingList")
private fun List<Nothing?>.withTypeNullable(
    expectedNulls: Boolean,
    nullabilityOptions: NullabilityOptions,
): Pair<List<Nothing?>, KType> {
    val nullable = nullabilityOptions.applyNullability(this, expectedNulls)
    return this to nothingType(nullable)
}

and then

is NullVector -> vector.values(range).withTypeNullable(field.isNullable, nullability)
Kopilov commented 5 months ago

@Jolanrensen applied, thanks