Open Kopilov opened 1 year ago
Hi! It's indeed seems confusing, but at the same time can be justified.
DataColumn
tries to be an efficient constructor, that's why it relies on supplied reified type parameter by default and doesn't check anything. It has infer
parameter to match nullability / type to data though.
Maybe it'll make sense to change its default behaviour to Infer.Nulls
and maybe introduce another function with explicit naming, like createWithoutTypeInference
DataColumn.create
behavior can be justified but what about hasNulls
? The issue is about hasNulls
primarily, without regard for data origin.
In real life, I want to save nullable (but without null values) column as not nullable in Arrow and get unexpected NullValueError
I agree about your change for hasNulls()
, it should only return true
if there are actually nulls in the column. But, do we also have a canHaveNulls()
? Because that sounds like useful behavior.
I still see it as a DataColumn.create
problem, maybe i miss something.
Its DataColumn.create
responsibility to match actual values with type and nullability, and hasNulls
only looks at type supplied to constructor. So, wouldn't DataColumn.create("col", listOf("a" as String?, "b" as String?, "c" as String?), Infer.Throw)
solve this problem? Like, if infered type is String?
, but values are of type String
, then throw an exception
i.e. it's expected that whenever someone creates data column, they should supply exact type (as if Infer.Type
is passed), then hasNulls
== isNullable
I still see it as a
DataColumn.create
problem, maybe i miss something. ItsDataColumn.create
responsibility to match actual values with type and nullability, andhasNulls
only looks at type supplied to constructor. So, wouldn'tDataColumn.create("col", listOf("a" as String?, "b" as String?, "c" as String?), Infer.Throw)
solve this problem? Like, if infered type isString?
, but values are of typeString
, then throw an exceptioni.e. it's expected that whenever someone creates data column, they should supply exact type (as if
Infer.Type
is passed), thenhasNulls
==isNullable
yes, but when you're generating type schemas from, say, OpenApi, this is not always the case. The column is nullable, since it can contain null
values, but it does not necessarily needs to contain them. Same as a List<Int?>
. Your logic works if the data is static, but if you want to reuse your logic for data that might be nullable sometimes then hasNulls
!= isNullable
DataColumn is created from known data and is immutable. There is no reason for it (DataColumn instance itself!) to have nullable type when created if there are no actual null values. You can still cast dataframe that contains it to schema with nullable property. From type system perspective it's fine.
@koperagen thank you for the opinion.
Currently I have made workaround in my project by rewrapping all columns with Infer.Type
before saving. (At the same time, I change blank strings with nulls if target type is not String but number and so on)
Initially i thought that you call DataColumn.create
yourself from your code, it seems i misunderstood the problem. You know where this column is created?
Currently it is created by DataFrame.Companion.readArrowFeather
method that keeps original Arrow schema (including nullability). Feather file is created in separate software and might be dirty but cleaner then Excel (that was used as main input on prevoius step, when I initially made Arrow IO). Also it can be any another data import. To make it absolutely clean, I also have to convert some columns to not nullable.
FYI, current project is BPLEX (Russian only: https://bia-tech.ru/solutions/platforma-dlya-optimizacii-2/) that actually is middleware for transferring data between business GUI or DWH and mathematics models as well as from one model output to another input.
And I have common entrypoint for internal data validation and saving (actually, 90% of it's code is Dataframe ArrowWriter itself)
So, do you think it can (or should) be fixed on DataFrame.Companion.readArrowFeather
side? I vaguely remember discussing nullability of columns when reading Arrow files, but not in much details.
It should not, in my mind.
Kotlin Dataframe is pretty good library, it's default behavior looks good for data science. But in can also be used (and is actually used, at least in BPLEX :) ) for data engineering. And in this case static schemas looks better. Arrow is also data engineering tool and it should keep it's approach together with Kotlin Dataframe, IMHO.
Once again, this issue is not about BPLEX (this is my problem) and not about Arrow (this is just one case) but about method name and it's actual behavior mismatch, without regard for data origin and sink. Please don't kill Dataframe data engineering opportunity.
It's very cool and inspiring that dataframe is used for such tasks successfully :) Regarding the issue, let me come back to it in a week or two. There's a decision to be made about how it should be done and i'm not ready to do it now.
Hello
If I get some dataframe with some column and I want to check if it contains null values, I use
hasNulls
property.expected: false actual: true
From another side, column can be marked as not nullable but contain null values (this behavior might be a critical issue itself)
expected: true or IllegalArgumentException actual: false