AttributeError: 'SparseDtype' object has no attribute 'itemsize' (Support for Pandas SparseArray columns)

dask / fastparquet

python implementation of the parquet columnar file format.

Apache License 2.0

787 stars 178 forks source link

AttributeError: 'SparseDtype' object has no attribute 'itemsize' (Support for Pandas SparseArray columns) #464

Open danielchalef opened 5 years ago

danielchalef commented 5 years ago

Fastparquet does not appear to support writing Dask dataframes with Pandas SparseArray columns. Doing so fails with:

AttributeError: 'SparseDtype' object has no attribute 'itemsize'

Pandas: 0.25.1 Dask: 2.4.0 Fastparquet: 0.3.2

martindurant commented 5 years ago

@TomAugspurger , can you think of a general way to deal with the various new types, or do we need a special case fir each? Sparse should en/decode very well to parquet, which also stores an array of nulls and the non-null values.

TomAugspurger commented 5 years ago

It's probably special cases for now. We have a general issue at https://github.com/pandas-dev/pandas/issues/20612.

Not relevant for fastparquet, but pyarrow has a __arrow_array__ protocol that arrays can implement to convert themselves to the correct format for serialization.

martindurant commented 5 years ago

Using fastparquet to convert from arrow data structures doesn't seem too useful :|

TomAugspurger commented 5 years ago

Yeah, I wasn't suggesting that. That's the approach pyarrow is taking, but I don't think fast parquet defining a protocol is reasonable.

On Thu, Oct 10, 2019 at 9:05 AM Martin Durant notifications@github.com wrote:

Using fastparquet to convert from arrow data structures doesn't seem too useful :|

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/fastparquet/issues/464?email_source=notifications&email_token=AAKAOIWHN6L6JW3Z3WNQKDLQN4ZCDA5CNFSM4I7GZXEKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEA4PB3Q#issuecomment-540602606, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIX6I47VMCCFUFJEZCTQN4ZCDANCNFSM4I7GZXEA .

martindurant commented 5 years ago

@danielchalef , would you like to have a stab at encoding the sparse time? I don't expect to have time to do this myself in the near future.

danielchalef commented 5 years ago

@martindurant I'd love to say "yes", but time and skill may limit how useful I'd be. Do you have a design approach in mind?

martindurant commented 5 years ago

It's been a little while since I've delved in the code, but in the column chunk writer, it basically already separates out the set of nulls and the (non-null) values, around here. In the sparse case, this is already done. The only tricky thing would be ensuring that the schema gets the right dtype before writing begins.

When reading these, there would be no fundamental difference between columns created from sparse versus any column that happens to have any nulls. The information could be stored in the metadata, though; @TomAugspurger , do you happen to know if there is a pandas metadata update to specify the "logical" type for an extension column like this?

TomAugspurger commented 5 years ago

I don't really follow, but in the case of a SparseArray, the metadata would be on SparseDtype, which has

subtype: the NumPy dtype:
fill_value: a scalar

And the two arrays to write are

sp_values : the non-fill_value values
sp_indicies : the non-fill_value locations

I don't know if there's any need for null handling. Can you write NaN in float arrays?

martindurant commented 5 years ago

By metadata, I mean the JSON representation that goes in the footer, because the column itself will be indistinguishable from the same dense array