Closed MarcoGorelli closed 4 months ago
In the case of something like pandas or other dataframe library that doesn't use the Arrow memory layout under the hood, they'd presumably materialize arrow on the __dataframe__
call and then have to slice the arrow format memory, which if containing strings or has a step size, isn't free. This is already potentially a problem in selecting columns as well, so I guess this inefficiency is nothing new?
Additionally, it makes it a bit hard to reason about when the producer vs when the consumer should do row selection. I.E. if Polars is consuming data from say PyArrow, I imagine Polars would rather handle row slicing itself (assuming you'll hit a situation where it's not pure pointer arithmetic). Now in the situation of Pandas consuming data from say Polars, you'd probably want Polars to handle the row slicing.
Arrow interchange protocols handle the slicing case (ignoring step size) by allowing specifying an offset and a size. Maybe we can do something similar here?
sounds good, thanks
Do we expect / want to encourage developers using dataframe libraries to explicitly call __dataframe__
themselves as opposed to using libraryx.from_dataframe(...)
? It feels a bit funky to me currently that we go from say:
pl_df = ... # My polars dataframe
pdf = pandas.from_dataFrame(pl_df)
to:
pl_df = ... # My polars dataframe
pdf = pandas.from_dataframe(pl_df.__dataframe__().select_columns(...).slice_rows(...))
My 2c is that this is just highlighting the lack of standard API here and that the experience should be something along the lines of (ignoring API names for column selection and row slicing):
pl_df = ... # My polars dataframe
pdf = pandas.from_dataframe(pl_df.cols(...).slice_rows(...))
Would be good to have others chime in here given this interchange protocol is already being adopted where we probably don't want to introduce something and later decide to change / remove it.
It's what plotly already does to not have to convert the entire dataframe
Any updates here please?
This is the only thing I plan to try adding to the interchange protocol, promised
I think of the interchange protocol as being useful to converting between libraries and doing some preselection in a standardised way:
cc @rgommers @jorisvandenbossche
gentle ping
(would really like to get this in for pandas 3.0 tbh, and this topic actually has a real world use case https://github.com/microsoft/vscode-jupyter/pull/13951)
this is just highlighting the lack of standard API here
the "standard api" solution would be:
pandas.from_dataframe(pl_df.__dataframe_consortium_standard__().select(...).take(...))
does that really look any less clunky?
I think of the interchange protocol as being useful to converting between libraries and doing some preselection in a standardised way:
select columns (currently possible) select rows (not possible)
The ability to select subset rows in addition to selecting columns seems harmonious.
Implementation in Modin should not be a problem.
+1
Any updates please?
closing due to lack of interest (this PR has been open for 5 months), thanks all for comments
closes #204