Open danielchalef opened 5 years ago
@TomAugspurger , can you think of a general way to deal with the various new types, or do we need a special case fir each? Sparse should en/decode very well to parquet, which also stores an array of nulls and the non-null values.
It's probably special cases for now. We have a general issue at https://github.com/pandas-dev/pandas/issues/20612.
Not relevant for fastparquet, but pyarrow has a __arrow_array__
protocol that arrays can implement to convert themselves to the correct format for serialization.
Using fastparquet to convert from arrow data structures doesn't seem too useful :|
Yeah, I wasn't suggesting that. That's the approach pyarrow is taking, but I don't think fast parquet defining a protocol is reasonable.
On Thu, Oct 10, 2019 at 9:05 AM Martin Durant notifications@github.com wrote:
Using fastparquet to convert from arrow data structures doesn't seem too useful :|
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/fastparquet/issues/464?email_source=notifications&email_token=AAKAOIWHN6L6JW3Z3WNQKDLQN4ZCDA5CNFSM4I7GZXEKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEA4PB3Q#issuecomment-540602606, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIX6I47VMCCFUFJEZCTQN4ZCDANCNFSM4I7GZXEA .
@danielchalef , would you like to have a stab at encoding the sparse time? I don't expect to have time to do this myself in the near future.
@martindurant I'd love to say "yes", but time and skill may limit how useful I'd be. Do you have a design approach in mind?
It's been a little while since I've delved in the code, but in the column chunk writer, it basically already separates out the set of nulls and the (non-null) values, around here. In the sparse case, this is already done. The only tricky thing would be ensuring that the schema gets the right dtype before writing begins.
When reading these, there would be no fundamental difference between columns created from sparse versus any column that happens to have any nulls. The information could be stored in the metadata, though; @TomAugspurger , do you happen to know if there is a pandas metadata update to specify the "logical" type for an extension column like this?
I don't really follow, but in the case of a SparseArray, the metadata would be on SparseDtype, which has
And the two arrays to write are
I don't know if there's any need for null handling. Can you write NaN
in float arrays?
By metadata, I mean the JSON representation that goes in the footer, because the column itself will be indistinguishable from the same dense array
Fastparquet does not appear to support writing Dask dataframes with Pandas SparseArray columns. Doing so fails with:
Pandas: 0.25.1 Dask: 2.4.0 Fastparquet: 0.3.2