Closed brl0 closed 4 years ago
Thanks for the report @brl0 . I've confirmed this on master.
The NaNs are dropped in kartothek.io_components.metapartition.MetaPartition._partition_data
because of pandas' handling of nulls during groupby (see: https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#groupby-missing).
@fjetter is this behavior something that should be expected? We might want to document that if it's the case
Well, it is expected if you know the implementation but from a user/api perspective we do not advertise that we have the same restrictions like pandas.groupby does.
I believe this is a valid request since the current behaviour silently drops data which may be fine in analyses scenarios (pandas) but not for data storage. we should not silently drop data. The very least is a warning or even better an exception.
Filling values is tricky and honestly I don't know how to approach this reasonably without opening us to rather extreme creep or weird APIs.
I would feel most comfortable with raising in this scenario where the exception suggests the users to take care of the filling themselves. After all, the users know best what sentinel values are appropriate for their application and we wouldn't need to break roundtrips.
I'd be curious how this is handled in arrow since technically speaking ['a', 'b', 'c', np.nan]
is not a string column (although recognized as such) but a mixed type array.
A side not to the implementation: The invariant we would like to preserve is the row count before and after. Checking this instead of null/nans is probably faster and more universal. If we detect this we can of course suggest (or even check it once manually) to the user that NaNs are a probably cause.
Problem description
When partitioning on a column, rows containing
NaN
in that column are dropped silently.It would be nice if there was some sort of warning.
IMO, an even better solution would be to fillna and provide a warning about difference in round-trip.
Example code (ideally copy-pastable)
Please provide a minimal reproducible code example to reproduce the behavior, c.f. https://stackoverflow.com/help/minimal-reproducible-example
Used versions