Kartothek metadata does not distinguish int64 and Int64

mlondschien commented 3 years ago

In [1]: from functools import partial
   ...: 
   ...: import pandas as pd
   ...: import storefact
   ...: from kartothek.core.common_metadata import empty_dataframe_from_schema
   ...: from kartothek.io.dask.dataframe import read_dataset_as_ddf
   ...: from kartothek.io.eager import read_table, store_dataframes_as_dataset
   ...: 
   ...: store = partial(storefact.get_store_from_url, "hfs:///tmp")
   ...: 
   ...: df = pd.DataFrame(
   ...:     {
   ...:         "i": [0, 1],
   ...:         "I": [0, pd.NA],
   ...:         "o": ["a", None],
   ...:         "s": ["b", pd.NA],
   ...:         "b": [True, False],
   ...:         "B": [True, pd.NA],
   ...:     }
   ...: )
   ...: df["I"] = df["I"].astype("Int64")
   ...: df["s"] = df["s"].astype("string")
   ...: df["B"] = df["B"].astype("boolean")
   ...: 
   ...: df.dtypes
Out[1]: 
i      int64
I      Int64
o     object
s     string
b       bool
B    boolean
dtype: object

In [2]: dm = store_dataframes_as_dataset(
   ...:     dfs=[df], dataset_uuid="dataset", store=store, overwrite=True
   ...: )
   ...: 
   ...: df_meta = empty_dataframe_from_schema(schema=dm.table_meta["table"])
   ...: df_meta.dtypes
Out[2]: 
B    boolean
I      int64
b       bool
i      int64
o     object
s     string
dtype: object

In [3]: ddf = read_dataset_as_ddf(dataset_uuid="dataset", store=store)
   ...: ddf._meta.dtypes
Out[3]: 
B    boolean
I      int64
b       bool
i      int64
o     object
s     string
dtype: object

In [4]: ddf.compute().dtypes
Out[4]: 
B    boolean
I      Int64
b       bool
i      int64
o     object
s     string
dtype: object

Copy pastable below:

```python from functools import partial import pandas as pd import storefact from kartothek.core.common_metadata import empty_dataframe_from_schema from kartothek.io.dask.dataframe import read_dataset_as_ddf from kartothek.io.eager import read_table, store_dataframes_as_dataset store = partial(storefact.get_store_from_url, "hfs:///tmp") df = pd.DataFrame( { "i": [0, 1], "I": [0, pd.NA], "o": ["a", None], "s": ["b", pd.NA], "b": [True, False], "B": [True, pd.NA], } ) df["I"] = df["I"].astype("Int64") df["s"] = df["s"].astype("string") df["B"] = df["B"].astype("boolean") df.dtypes dm = store_dataframes_as_dataset( dfs=[df], dataset_uuid="dataset", store=store, overwrite=True ) df_meta = empty_dataframe_from_schema(schema=dm.table_meta["table"]) df_meta.dtypes ddf = read_dataset_as_ddf(dataset_uuid="dataset", store=store) ddf._meta.dtypes ddf.compute().dtypes ```

The dtype is incorrectly stored in the kartothek metadata:

In [12]: store().get(dataset/table/_common_metadata)
Out[12]: b'PAR1\x15\x04\x19|5\x00\x18\x06schema\x15\x0c\x00\x15\x00%\x02\x18\x01B\x00\x15\x04%\x02\x18\x01I\x00\x15\x00%\x02\x18\x01b\x00\x15\x04%\x02\x18\x01i\x00\x15\x0c%\x02\x18\x01o%\x00L\x1c\x00\x00\x00\x15\x0c%\x02\x18\x01s%\x00L\x1c\x00\x00\x00\x16\x00\x19\x0c\x19,\x18\x06pandas\x18\x99\x07{"column_indexes": [{"field_name": null, "metadata": {"encoding": "UTF-8"}, "name": null, "numpy_type": "object", "pandas_type": "unicode"}], "columns": [{"field_name": "B", "metadata": null, "name": "B", "numpy_type": "boolean", "pandas_type": "bool"}, {"field_name": "I", "metadata": null, "name": "I", "numpy_type": "int64", "pandas_type": "int64"}, {"field_name": "b", "metadata": null, "name": "b", "numpy_type": "bool", "pandas_type": "bool"}, {"field_name": "i", "metadata": null, "name": "i", "numpy_type": "int64", "pandas_type": "int64"}, {"field_name": "o", "metadata": null, "name": "o", "numpy_type": "object", "pandas_type": "unicode"}, {"field_name": "s", "metadata": null, "name": "s", "numpy_type": "string", "pandas_type": "unicode"}], "creator": {"library": "pyarrow", "version": "1.0.1"}, "index_columns": [{"kind": "range", "name": null, "start": 0, "step": 1, "stop": 2}], "pandas_version": "1.1.3"}\x00\x18\x0cARROW:schema\x18\xe0\r/////yAFAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABBAAQAAAAAAAKAAwAAAAEAAgACgAAANADAAAEAAAAAQAAAAwAAAAIAAwABAAIAAgAAAAIAAAAEAAAAAYAAABwYW5kYXMAAJkDAAB7ImNvbHVtbl9pbmRleGVzIjogW3siZmllbGRfbmFtZSI6IG51bGwsICJtZXRhZGF0YSI6IHsiZW5jb2RpbmciOiAiVVRGLTgifSwgIm5hbWUiOiBudWxsLCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAicGFuZGFzX3R5cGUiOiAidW5pY29kZSJ9XSwgImNvbHVtbnMiOiBbeyJmaWVsZF9uYW1lIjogIkIiLCAibWV0YWRhdGEiOiBudWxsLCAibmFtZSI6ICJCIiwgIm51bXB5X3R5cGUiOiAiYm9vbGVhbiIsICJwYW5kYXNfdHlwZSI6ICJib29sIn0sIHsiZmllbGRfbmFtZSI6ICJJIiwgIm1ldGFkYXRhIjogbnVsbCwgIm5hbWUiOiAiSSIsICJudW1weV90eXBlIjogImludDY0IiwgInBhbmRhc190eXBlIjogImludDY0In0sIHsiZmllbGRfbmFtZSI6ICJiIiwgIm1ldGFkYXRhIjogbnVsbCwgIm5hbWUiOiAiYiIsICJudW1weV90eXBlIjogImJvb2wiLCAicGFuZGFzX3R5cGUiOiAiYm9vbCJ9LCB7ImZpZWxkX25hbWUiOiAiaSIsICJtZXRhZGF0YSI6IG51bGwsICJuYW1lIjogImkiLCAibnVtcHlfdHlwZSI6ICJpbnQ2NCIsICJwYW5kYXNfdHlwZSI6ICJpbnQ2NCJ9LCB7ImZpZWxkX25hbWUiOiAibyIsICJtZXRhZGF0YSI6IG51bGwsICJuYW1lIjogIm8iLCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAicGFuZGFzX3R5cGUiOiAidW5pY29kZSJ9LCB7ImZpZWxkX25hbWUiOiAicyIsICJtZXRhZGF0YSI6IG51bGwsICJuYW1lIjogInMiLCAibnVtcHlfdHlwZSI6ICJzdHJpbmciLCAicGFuZGFzX3R5cGUiOiAidW5pY29kZSJ9XSwgImNyZWF0b3IiOiB7ImxpYnJhcnkiOiAicHlhcnJvdyIsICJ2ZXJzaW9uIjogIjEuMC4xIn0sICJpbmRleF9jb2x1bW5zIjogW3sia2luZCI6ICJyYW5nZSIsICJuYW1lIjogbnVsbCwgInN0YXJ0IjogMCwgInN0ZXAiOiAxLCAic3RvcCI6IDJ9XSwgInBhbmRhc192ZXJzaW9uIjogIjEuMS4zIn0AAAAGAAAA9AAAAKwAAACEAAAAVAAAACwAAAAEAAAANP///wAAAQUUAAAADAAAAAQAAAAAAAAAJP///wEAAABzAAAAWP///wAAAQUUAAAADAAAAAQAAAAAAAAASP///wEAAABvAAAAfP///wAAAQIcAAAADAAAAAQAAAAAAAAAsP///wAAAAFAAAAAAQAAAGkAAACo////AAABBhQAAAAMAAAABAAAAAAAAACY////AQAAAGIAAADM////AAABAiQAAAAUAAAABAAAAAAAAAAIAAwACAAHAAgAAAAAAAABQAAAAAEAAABJAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAEGGAAAABAAAAAEAAAAAAAAAAQABAAEAAAAAQAAAEIAAAAAAAAA\x00\x18"parquet-cpp version 1.5.1-SNAPSHOT\x19l\x1c\x00\x00\x1c\x00\x00\x1c\x00\x00\x1c\x00\x00\x1c\x00\x00\x1c\x00\x00\x00#\x0b\x00\x00PAR1'

fjetter commented 3 years ago

IIUC, this affects only the schema but we're able to read the data properly?

fjetter commented 3 years ago

Looks like this is because arrow doesn't distinguish between the two and we're defining the _common_metadata purely via the arrow types

In [1]: import pyarrow as pa

In [2]: import pandas as pd

In [3]: df = pd.DataFrame({"Int": pd.Series([1, pd.NA], dtype="Int64")})

In [4]: schema = pa.Schema.from_pandas(df)

In [5]: schema
Out[5]:
Int: int64

I'd be curious if there are any practical implications for this discrepancy or if this is rather a 'formal' error

mlondschien commented 3 years ago

We use Kartothek datasets to cache computation results. For validation, we check if the Kartothek metadata matches the expected dtypes. We've not had any issues loading data with such columns yet. So this is mostly an inconvenience (or "formal" error).

fjetter commented 3 years ago

That's very interesting and I'm positively surprised that this works generally. afaik, we do not have any tests using the nullables types in kartothek, yet (but it's about time). If you want to contribute on that front, I suggest to start with adding some nullable ints/bools to https://github.com/JDASoftwareGroup/kartothek/blob/6514c1f06a8df2f8c0b23a643a4114f343d2ccf9/kartothek/serialization/testing.py#L27 and see what breaks

If I understand this correctly, it also only affects integers, correct? bool(eans) are correctly reconstructed.

I assume this is connected to us stripping the metadata from the schema. I'm just wondering why it works for bools https://github.com/JDASoftwareGroup/kartothek/blob/6514c1f06a8df2f8c0b23a643a4114f343d2ccf9/kartothek/core/common_metadata.py#L612-L614

 dm.table_meta["table"].internal().empty_table().to_pandas().dtypes
Out[5]:
B    boolean
I      int64
b       bool
i      int64
o     object
s     string
dtype: object

mlondschien commented 3 years ago

We're not using these in production but have tests (that need skipping). Is there a specific reason you are not testing for categoricals?

fjetter commented 3 years ago

Is there a specific reason you are not testing for categoricals

We do have tests for categoricals but not systematically as part of the "all types dataframes"

JDASoftwareGroup / kartothek

Kartothek metadata does not distinguish int64 and Int64 #410