Open lr4d opened 4 years ago
This is a type system issue:
We currently allow users to write all-None (NULL for arrow) columns for every payload type, no matter if the column on the pandas side is/was an object column or not. So the above bug also triggers if you replace np.datetime64("NaT")
with 1.0
or even 1
. The bug only occurs with the DDF backend since it's the only one that actually checks the types during read. To solve this, we let's discuss the possible fixes and implications:
Instead of reading NULL columns back as [None]
, we could convert them the the target dtype instead. This however does not work if the target dtype is not even nullable (for example ints), so I would rule this out.
The reasoning behind the acceptance of all-NULL columns was that you cannot detect the type properly if on the pandas side the column is an object column (e.g. you cannot know if it was meant to be string
, date32
, list[int]
, ...). I would propose we limit the check to actual object columns.
Now we run into one issue here: dates_as_objects
. Our API currently allows dates to be both, object
and pandas datetime64[ns]
columns. This design choice resulted in many issues, since you have to implement this (even depending on the concrete flag the user provided) everywhere in the code: during writes, during index evaluation, during predicate pushdown, etc. I would therefore propose we drop this flag first (new hard-wired default is dates_as_objects=True
) and then we implement stricter NULL checks.
CC @florian-jetter-jdas
The reasoning behind the acceptance of all-NULL columns was that you cannot detect the type properly if on the pandas side the column is an object column (e.g. you cannot know if it was meant to be string, date32, list[int], ...). I would propose we limit the check to actual object columns.
Not sure if I get what you're trying to say in the last sentence. I've confirmed that the bug indeed also occurs for other non-object dtypes.
What I'd see then as a solution is having a type check on write which disallows updating a column with an object dtype if the column is not defined as an object data type (e.g. int, float).
However, this would not solve the case for datetime columns since, as you mention, because of dates_as_object
, we currently accept both object and datetime64[ns]
types for these columns; hence, if we want to keep this flag, I would suggest we convert these columns to datetime64[ns]
before write.
What I'd see then as a solution is having a type check on write which disallows updating a column with an object dtype if the column is not defined as an object data type (e.g. int, float).
This is already forbidden but if the entire column of a partition is None we do not know which type is is and simply allow it.
I would suggest we convert these columns to
Honestly, I would like to stay away from type casting as far as possible, especially since we deal usually with four slightly different type systems here (pandas, numpy, arrow and parquet). I would prefer drawing a hard line here.
I would therefore propose we drop this flag first (new hard-wired default is dates_as_objects=True) and then we implement stricter NULL checks.
The cost of maintaining this outweigh the value of the feature so this is fine by me.
Problem description
When working with datetime columns and using
dates_as_objects=False
, I expect null values to benp.datetime
objects (i.e.np.datetime64("NaT")
). However, kartothek is allowing an update with a schema mismatch, as one can add a partition with PythonNone
values for a datetime column. When reading back the data, problems occur when concatenating the data with different data types:eager
reads thedatetime
column as anobject
, whereasdask
raises an exception.Example code (ideally copy-pastable)
Used versions
git master branch (3.5.2.dev13+g0bddd80)