Open mlondschien opened 3 years ago
IIUC, this affects only the schema but we're able to read the data properly?
Looks like this is because arrow doesn't distinguish between the two and we're defining the _common_metadata purely via the arrow types
In [1]: import pyarrow as pa
In [2]: import pandas as pd
In [3]: df = pd.DataFrame({"Int": pd.Series([1, pd.NA], dtype="Int64")})
In [4]: schema = pa.Schema.from_pandas(df)
In [5]: schema
Out[5]:
Int: int64
I'd be curious if there are any practical implications for this discrepancy or if this is rather a 'formal' error
We use Kartothek datasets to cache computation results. For validation, we check if the Kartothek metadata matches the expected dtypes. We've not had any issues loading data with such columns yet. So this is mostly an inconvenience (or "formal" error).
That's very interesting and I'm positively surprised that this works generally. afaik, we do not have any tests using the nullables types in kartothek, yet (but it's about time). If you want to contribute on that front, I suggest to start with adding some nullable ints/bools to https://github.com/JDASoftwareGroup/kartothek/blob/6514c1f06a8df2f8c0b23a643a4114f343d2ccf9/kartothek/serialization/testing.py#L27 and see what breaks
If I understand this correctly, it also only affects integers, correct? bool(eans) are correctly reconstructed.
I assume this is connected to us stripping the metadata from the schema. I'm just wondering why it works for bools https://github.com/JDASoftwareGroup/kartothek/blob/6514c1f06a8df2f8c0b23a643a4114f343d2ccf9/kartothek/core/common_metadata.py#L612-L614
dm.table_meta["table"].internal().empty_table().to_pandas().dtypes
Out[5]:
B boolean
I int64
b bool
i int64
o object
s string
dtype: object
We're not using these in production but have tests (that need skipping). Is there a specific reason you are not testing for categoricals?
Is there a specific reason you are not testing for categoricals
We do have tests for categoricals but not systematically as part of the "all types dataframes"
Copy pastable below:
The dtype is incorrectly stored in the kartothek metadata: