data-apis / dataframe-api

RFC document, tooling and other content related to the dataframe API standard
https://data-apis.org/dataframe-api/draft/index.html
MIT License
97 stars 20 forks source link

Sparse columns #55

Open ogrisel opened 2 years ago

ogrisel commented 2 years ago

Should a dedicated API/column metadata to efficiently support sparse columns be part of the spec?

Context

It can be the case than a given column has more more than 99% of its values that are null or missing (or other repeated constant value) and therefore we would waste both memory and computation by using a dedicated memory representation that does not materialize explicitly these repeated values.

Use cases

Limitations

Survey of existing support

(incomplete, feel free to edit or comment)

Questions:

ogrisel commented 2 years ago

Note: there is a dedicated discussion for single-column categorical data representation in #41.

rgommers commented 2 years ago

Should sparse datastructures be allowed to represent both missingness and nullness or only one of those? (I assume both would be useful as pandas does with the fill_value param)

That's a really subtle question, which isn't even worked out in array/tensor libraries that provide sparse data structures. My first impression was leaving it undefined, because interpretation does not necessarily depend on memory layout. However there is an interaction with the missing data support already, so that may not be feasible.

fill_value was something that was looked at quite a bit for PyTorch, but it seems like there's additional complexity and very limited use cases for non-zero fill values.

rgommers commented 2 years ago

Should this be some kind of optional module / extension of the main dataframe API spec?

It seems like there's only a few libraries that support sparse columns. Perhaps a first step would be to use the metadata attribute to store a sparse column and see if two of those libraries can be made to work together. A concrete use case would help a lot.

Memory layout wise sparse is a bit of a problem. Pandas seems to use COO; scipy.sparse has many formats however CSR/CSC are the most performant ones. It'd be nontrivial to have a clear memory layout description here that isn't overly complex.

ogrisel commented 2 years ago

A concrete use case would help a lot.

A concrete use case would be to do lossless rountrip conversions of very sparse data between libraries that implement sparse columns either for zeroness or missingness (or both ideally) without triggering a unexpectedly large memory allocation or a MemoryError or trigger the OOM killer.

For instance we could have a dataframe storing the one-hot encoded representation of 6M Wikipedia abstracts with 100000 columns for the 100000 most frequent words in Wikipedia. Assuming Wikipedia abstracts have much less than 1000 words on average, this should easily fit in memory using a sparse representation but this would probably break (or be very inefficient) if the conversion is trying to materialize the zeros silently.

ogrisel commented 2 years ago

That being said, I am not sure that dataframe libraries are used often for this kind of sparse data manipulation. Furthermore text processing with one-hot encoding is less and less popular now that most interesting NLP tasks are done using lower dimensional dense embeddings from pre-trained neural networks.

rgommers commented 2 years ago

Thanks @ogrisel, the application makes a lot of sense.

That being said, I am not sure that dataframe libraries are used often for this kind of sparse data manipulation.

Indeed, with use case I also meant: can this actually be done today with two dataframe libraries? If there's no two libraries with support for the same format of sparse data, then adding the capability to the protocol may be a bit premature.

ogrisel commented 2 years ago

pandas and vaex both support sparse data (for zeroness) without materialization although with different memory layouts. vaex uses a scipy.sparse CSR matrix while pandas have individual sparse columns.

arrow has null chunks that do not store any values if a full chunk is null.

rgommers commented 2 years ago

So we probably should have a prototype that goes from one of pandas/Vaex/Arrow to another one of those libraries without a densification step in between. That may result in something that can be generalized. Given that scipy.sparse should be able to convert between CSR and COO efficiently and pandas is based on COO (with a df.sparse.to_coo() to export to scipy.sparse format), that should be doable.