data-apis / dataframe-api

RFC document, tooling and other content related to the dataframe API standard
https://data-apis.org/dataframe-api/draft/index.html
MIT License
103 stars 20 forks source link

Add `slice_rows` to interchange protocol #349

Closed MarcoGorelli closed 4 months ago

MarcoGorelli commented 9 months ago

closes #204

kkraus14 commented 9 months ago

In the case of something like pandas or other dataframe library that doesn't use the Arrow memory layout under the hood, they'd presumably materialize arrow on the __dataframe__ call and then have to slice the arrow format memory, which if containing strings or has a step size, isn't free. This is already potentially a problem in selecting columns as well, so I guess this inefficiency is nothing new?

Additionally, it makes it a bit hard to reason about when the producer vs when the consumer should do row selection. I.E. if Polars is consuming data from say PyArrow, I imagine Polars would rather handle row slicing itself (assuming you'll hit a situation where it's not pure pointer arithmetic). Now in the situation of Pandas consuming data from say Polars, you'd probably want Polars to handle the row slicing.

Arrow interchange protocols handle the slicing case (ignoring step size) by allowing specifying an offset and a size. Maybe we can do something similar here?

MarcoGorelli commented 9 months ago

sounds good, thanks

kkraus14 commented 9 months ago

Do we expect / want to encourage developers using dataframe libraries to explicitly call __dataframe__ themselves as opposed to using libraryx.from_dataframe(...)? It feels a bit funky to me currently that we go from say:

pl_df = ...  # My polars dataframe
pdf = pandas.from_dataFrame(pl_df)

to:

pl_df = ...  # My polars dataframe
pdf = pandas.from_dataframe(pl_df.__dataframe__().select_columns(...).slice_rows(...))

My 2c is that this is just highlighting the lack of standard API here and that the experience should be something along the lines of (ignoring API names for column selection and row slicing):

pl_df = ...  # My polars dataframe
pdf = pandas.from_dataframe(pl_df.cols(...).slice_rows(...))
kkraus14 commented 9 months ago

Would be good to have others chime in here given this interchange protocol is already being adopted where we probably don't want to introduce something and later decide to change / remove it.

MarcoGorelli commented 9 months ago

It's what plotly already does to not have to convert the entire dataframe

MarcoGorelli commented 9 months ago

Any updates here please?

This is the only thing I plan to try adding to the interchange protocol, promised

I think of the interchange protocol as being useful to converting between libraries and doing some preselection in a standardised way:

cc @rgommers @jorisvandenbossche

MarcoGorelli commented 9 months ago

gentle ping

(would really like to get this in for pandas 3.0 tbh, and this topic actually has a real world use case https://github.com/microsoft/vscode-jupyter/pull/13951)


this is just highlighting the lack of standard API here

the "standard api" solution would be:

pandas.from_dataframe(pl_df.__dataframe_consortium_standard__().select(...).take(...))

does that really look any less clunky?

anmyachev commented 8 months ago

I think of the interchange protocol as being useful to converting between libraries and doing some preselection in a standardised way:

select columns (currently possible) select rows (not possible)

The ability to select subset rows in addition to selecting columns seems harmonious.

Implementation in Modin should not be a problem.

+1

MarcoGorelli commented 5 months ago

Any updates please?

MarcoGorelli commented 4 months ago

closing due to lack of interest (this PR has been open for 5 months), thanks all for comments