data-apis / dataframe-api

RFC document, tooling and other content related to the dataframe API standard
https://data-apis.org/dataframe-api/draft/index.html
MIT License
102 stars 20 forks source link

Remove DataFrame.take #347

Closed MarcoGorelli closed 8 months ago

MarcoGorelli commented 9 months ago

@kkraus14 https://github.com/data-apis/dataframe-api/issues/344#issuecomment-1906851412

dataframe with arbitrary / undefined order

If we only have one DataFrame class, and its order is undefined, then DataFrame.take isn't a well-defined operation

Alternatives

Accept some level of re-design, even if it means extra work. But with the current design, DataFrame.take is undefined, so I suggest we remove it first

jorisvandenbossche commented 9 months ago

But then how do you use the methods we have that return indices? (sorted_indices, unique_indices)

MarcoGorelli commented 9 months ago

Exactly, you don't

Unless we accept some level of redesign, starting with https://github.com/data-apis/dataframe-api/issues/346

kkraus14 commented 9 months ago

A DataFrame can have an arbitrary or can have an undefined order, but that doesn't mean it has to be. If it has a defined order or an arbitrary order, i.e. someone ran a sort operation against it, or the operations run thus far are defined to be order maintaining, then take is well defined. If someone ran something that makes no ordering guarantees then the order could be undefined, in which calling take against it should be able to return an undefined order as well.

The only situations where take is arguably undefined is when the input order is undefined, where that feels like perfectly reasonable behavior to me.

Let's continue discussion in #346 regarding Expressions, but I don't think take is a problematic operation.

MarcoGorelli commented 9 months ago

How does a user know if a dataframe has input order defined or not?

shwina commented 9 months ago

How does a user know if a dataframe has input order defined or not?

Some examples:

kkraus14 commented 9 months ago

I think we could also generally specify that operations maintain the input order of the DataFrame unless otherwise noted. I believe we've made sure to add that into the docstring where appropriate, i.e. things like joins, groupbys, getting unique values, etc. are documented to not guarantee a specific output order.