kartothek allows updating a non-object column with Python `None` values

lr4d commented 4 years ago

Problem description

When working with datetime columns and using dates_as_objects=False, I expect null values to be np.datetime objects (i.e. np.datetime64("NaT")). However, kartothek is allowing an update with a schema mismatch, as one can add a partition with Python None values for a datetime column. When reading back the data, problems occur when concatenating the data with different data types: eager reads the datetime column as an object, whereas dask raises an exception.

Example code (ideally copy-pastable)

from functools import partial
from tempfile import TemporaryDirectory

import numpy as np
import pandas as pd
import storefact

from kartothek.io.dask.dataframe import read_dataset_as_ddf
from kartothek.io.eager import (
    store_dataframes_as_dataset,
    read_table,
    update_dataset_from_dataframes,
)

store_factory = partial(
    storefact.get_store_from_url, f"hfs://{TemporaryDirectory().name}"
)
dataset_uuid = "uuid"

dm = store_dataframes_as_dataset(
    dataset_uuid=dataset_uuid,
    store=store_factory,
    dfs=[pd.DataFrame({"date": [np.datetime64("NaT")]}),],
)

initial_dtype = read_table(
    store=store_factory, dataset_uuid=dataset_uuid, table="table"
).dtypes  # datetime64

dm = update_dataset_from_dataframes(
    dataset_uuid=dataset_uuid,
    store=store_factory,
    df_list=[pd.DataFrame({"date": [None]})],
)

updated_dtype = read_table(
    store=store_factory, dataset_uuid=dataset_uuid, table="table", dates_as_object=False
).dtypes  # object

# Using dask
t = read_dataset_as_ddf(store=store_factory, dataset_uuid=dataset_uuid, table="table")
t.compute()  # raises exception

Used versions

git master branch (3.5.2.dev13+g0bddd80)

marco-neumann-by commented 4 years ago

This is a type system issue:

We currently allow users to write all-None (NULL for arrow) columns for every payload type, no matter if the column on the pandas side is/was an object column or not. So the above bug also triggers if you replace np.datetime64("NaT") with 1.0 or even 1. The bug only occurs with the DDF backend since it's the only one that actually checks the types during read. To solve this, we let's discuss the possible fixes and implications:

A: Convert NULL Columns During Read

Instead of reading NULL columns back as [None], we could convert them the the target dtype instead. This however does not work if the target dtype is not even nullable (for example ints), so I would rule this out.

B: Stricter NULL check

The reasoning behind the acceptance of all-NULL columns was that you cannot detect the type properly if on the pandas side the column is an object column (e.g. you cannot know if it was meant to be string, date32, list[int], ...). I would propose we limit the check to actual object columns.

Now we run into one issue here: dates_as_objects. Our API currently allows dates to be both, object and pandas datetime64[ns] columns. This design choice resulted in many issues, since you have to implement this (even depending on the concrete flag the user provided) everywhere in the code: during writes, during index evaluation, during predicate pushdown, etc. I would therefore propose we drop this flag first (new hard-wired default is dates_as_objects=True) and then we implement stricter NULL checks.

CC @florian-jetter-jdas

lr4d commented 4 years ago

The reasoning behind the acceptance of all-NULL columns was that you cannot detect the type properly if on the pandas side the column is an object column (e.g. you cannot know if it was meant to be string, date32, list[int], ...). I would propose we limit the check to actual object columns.

Not sure if I get what you're trying to say in the last sentence. I've confirmed that the bug indeed also occurs for other non-object dtypes.

What I'd see then as a solution is having a type check on write which disallows updating a column with an object dtype if the column is not defined as an object data type (e.g. int, float). However, this would not solve the case for datetime columns since, as you mention, because of dates_as_object, we currently accept both object and datetime64[ns] types for these columns; hence, if we want to keep this flag, I would suggest we convert these columns to datetime64[ns] before write.

fjetter commented 4 years ago

What I'd see then as a solution is having a type check on write which disallows updating a column with an object dtype if the column is not defined as an object data type (e.g. int, float).

This is already forbidden but if the entire column of a partition is None we do not know which type is is and simply allow it.

I would suggest we convert these columns to

Honestly, I would like to stay away from type casting as far as possible, especially since we deal usually with four slightly different type systems here (pandas, numpy, arrow and parquet). I would prefer drawing a hard line here.

I would therefore propose we drop this flag first (new hard-wired default is dates_as_objects=True) and then we implement stricter NULL checks.

The cost of maintaining this outweigh the value of the feature so this is fine by me.

JDASoftwareGroup / kartothek