Kotlin / dataframe

Structured data processing in Kotlin
https://kotlin.github.io/dataframe/overview.html
Apache License 2.0
784 stars 50 forks source link

DataColumn nullability in JDBC #541

Closed koperagen closed 2 months ago

koperagen commented 7 months ago

I'd argue that KType nullability should always check actual column values. https://github.com/Kotlin/dataframe/blob/master/dataframe-jdbc/src/main/kotlin/org/jetbrains/kotlinx/dataframe/io/readJdbc.kt#L597 Which is done by infer = Infer.Nulls My reasoning is mostly related to notebooks Pros: you won't have to handle nullable values if given snapshot doesn't have any! Very convenient if you just want to work with specific fragment of data Cons: Imagine you want to rerun the same notebook, but this time data has nulls. Now, you'll have to modify your code to handle it, or it will be compilation error So, depending on your use case: explore data once vs reuse notebook, desirable behavior can vary. My suggestion here: to support re-usability of notebooks, JDBC integration should have method to import data schema from DB schema the same way as open api support does.

Things to consider here: it's already possible to write (or generate and edit) a data schema to rerun notebooks without problems. There're other operation that work like this: add, convert and other functions will create nullable KType only if there are nulls, as well as other data sources (discussion about this in context of Arrow: https://github.com/Kotlin/dataframe/issues/428 with additional argument about KType nullability)

public inline fun <reified R, T> DataFrame<T>.add(
    name: String,
    noinline expression: AddExpression<T, R>
): DataFrame<T> = add(name, Infer.Nulls,  expression)
koperagen commented 7 months ago

Simple idea: add Infer parameter in readJdbc functions to let people decide whether they want nullability from schema Infer.None or from values Infer.Nulls