Kotlin / dataframe

Structured data processing in Kotlin
https://kotlin.github.io/dataframe/overview.html
Apache License 2.0
833 stars 60 forks source link

Column `hasNulls` property is actually isNullable #428

Open Kopilov opened 1 year ago

Kopilov commented 1 year ago

Hello

If I get some dataframe with some column and I want to check if it contains null values, I use hasNulls property.

val nullable = DataColumn.create("nullable", listOf("a" as String?, "b" as String?, "c" as String?))
println(nullable.hasNulls)

expected: false actual: true

From another side, column can be marked as not nullable but contain null values (this behavior might be a critical issue itself)

    val notNullable = DataColumn.create(
        "notNullable",
        listOf("a" as String?, "b" as String?, "c" as String?, null as String?),
        typeOf<String>().withNullability(false)
    )
    println(notNullable.hasNulls)

expected: true or IllegalArgumentException actual: false

koperagen commented 1 year ago

Hi! It's indeed seems confusing, but at the same time can be justified. DataColumn tries to be an efficient constructor, that's why it relies on supplied reified type parameter by default and doesn't check anything. It has infer parameter to match nullability / type to data though. Maybe it'll make sense to change its default behaviour to Infer.Nulls and maybe introduce another function with explicit naming, like createWithoutTypeInference

Kopilov commented 1 year ago

DataColumn.create behavior can be justified but what about hasNulls? The issue is about hasNulls primarily, without regard for data origin.

In real life, I want to save nullable (but without null values) column as not nullable in Arrow and get unexpected NullValueError

Jolanrensen commented 1 year ago

I agree about your change for hasNulls(), it should only return true if there are actually nulls in the column. But, do we also have a canHaveNulls()? Because that sounds like useful behavior.

koperagen commented 1 year ago

I still see it as a DataColumn.create problem, maybe i miss something. Its DataColumn.create responsibility to match actual values with type and nullability, and hasNulls only looks at type supplied to constructor. So, wouldn't DataColumn.create("col", listOf("a" as String?, "b" as String?, "c" as String?), Infer.Throw) solve this problem? Like, if infered type is String?, but values are of type String, then throw an exception

i.e. it's expected that whenever someone creates data column, they should supply exact type (as if Infer.Type is passed), then hasNulls == isNullable

Jolanrensen commented 1 year ago

I still see it as a DataColumn.create problem, maybe i miss something. Its DataColumn.create responsibility to match actual values with type and nullability, and hasNulls only looks at type supplied to constructor. So, wouldn't DataColumn.create("col", listOf("a" as String?, "b" as String?, "c" as String?), Infer.Throw) solve this problem? Like, if infered type is String?, but values are of type String, then throw an exception

i.e. it's expected that whenever someone creates data column, they should supply exact type (as if Infer.Type is passed), then hasNulls == isNullable

yes, but when you're generating type schemas from, say, OpenApi, this is not always the case. The column is nullable, since it can contain null values, but it does not necessarily needs to contain them. Same as a List<Int?>. Your logic works if the data is static, but if you want to reuse your logic for data that might be nullable sometimes then hasNulls != isNullable

koperagen commented 1 year ago

DataColumn is created from known data and is immutable. There is no reason for it (DataColumn instance itself!) to have nullable type when created if there are no actual null values. You can still cast dataframe that contains it to schema with nullable property. From type system perspective it's fine.

Kopilov commented 1 year ago

@koperagen thank you for the opinion.

Currently I have made workaround in my project by rewrapping all columns with Infer.Type before saving. (At the same time, I change blank strings with nulls if target type is not String but number and so on)

koperagen commented 1 year ago

Initially i thought that you call DataColumn.create yourself from your code, it seems i misunderstood the problem. You know where this column is created?

Kopilov commented 1 year ago

Currently it is created by DataFrame.Companion.readArrowFeather method that keeps original Arrow schema (including nullability). Feather file is created in separate software and might be dirty but cleaner then Excel (that was used as main input on prevoius step, when I initially made Arrow IO). Also it can be any another data import. To make it absolutely clean, I also have to convert some columns to not nullable.

FYI, current project is BPLEX (Russian only: https://bia-tech.ru/solutions/platforma-dlya-optimizacii-2/) that actually is middleware for transferring data between business GUI or DWH and mathematics models as well as from one model output to another input.

Kopilov commented 1 year ago

And I have common entrypoint for internal data validation and saving (actually, 90% of it's code is Dataframe ArrowWriter itself)

koperagen commented 1 year ago

So, do you think it can (or should) be fixed on DataFrame.Companion.readArrowFeather side? I vaguely remember discussing nullability of columns when reading Arrow files, but not in much details.

Kopilov commented 1 year ago

It should not, in my mind.

Kotlin Dataframe is pretty good library, it's default behavior looks good for data science. But in can also be used (and is actually used, at least in BPLEX :) ) for data engineering. And in this case static schemas looks better. Arrow is also data engineering tool and it should keep it's approach together with Kotlin Dataframe, IMHO.

Once again, this issue is not about BPLEX (this is my problem) and not about Arrow (this is just one case) but about method name and it's actual behavior mismatch, without regard for data origin and sink. Please don't kill Dataframe data engineering opportunity.

koperagen commented 1 year ago

It's very cool and inspiring that dataframe is used for such tasks successfully :) Regarding the issue, let me come back to it in a week or two. There's a decision to be made about how it should be done and i'm not ready to do it now.