data-apis / dataframe-api

RFC document, tooling and other content related to the dataframe API standard
https://data-apis.org/dataframe-api/draft/index.html
MIT License
102 stars 20 forks source link

How to implement `get_rows_by_mask`? #118

Closed MarcoGorelli closed 1 year ago

MarcoGorelli commented 1 year ago

The current signature is

    def get_rows_by_mask(self, mask: "Column[bool]") -> DataFrame:

Let's take pandas as an example: I have a standard DataFrame, which is backed by df = pd.DataFrame{'a': [1,2,3]}). I'd like to get rows by the mask [True, False, True], which is inside a Column mask.

How do I get the values out of mask such that I can pass them to the underlying pandas DataFrame and call df.loc[mask]?

dlpack was mentioned in the last call, although that doesn't support booleans. So, even if there was a Column.__dlpack__ method, then something like this wouldn't work:

def get_rows_by_mask(self, mask):
    mask_array = np.from_dlpack(mask)
    return self.dataframe.loc[mask_array].__dataframe_standard__()

as the first line would error with

BufferError: DLPack only supports signed/unsigned integers, float and complex dtypes.
jbrockmendel commented 1 year ago

I'd expect the Column to by library-specific, so the extraction would be library-specific too

MarcoGorelli commented 1 year ago

Current Column is part of the standard though:

https://github.com/data-apis/dataframe-api/blob/main/spec/API_specification/column_object.rst

It's also part of several Standards' signatures

https://github.com/data-apis/dataframe-api/blob/399b5c40c954b55ecc6dfe5aaedaea542007b53a/spec/API_specification/dataframe_api/dataframe_object.py#L39

https://github.com/data-apis/dataframe-api/blob/399b5c40c954b55ecc6dfe5aaedaea542007b53a/spec/API_specification/dataframe_api/dataframe_object.py#L135

https://github.com/data-apis/dataframe-api/blob/399b5c40c954b55ecc6dfe5aaedaea542007b53a/spec/API_specification/dataframe_api/dataframe_object.py#L116

If we don't agree on what's in a column, or on how to use it, then I don't see how one can hope to write

mask = ...  # create a mask
df_standard.get_rows_by_mask(mask)

in a portable manner

jbrockmendel commented 1 year ago

Poor wording on my part. I mean the implementation of Column is library-specific, so i expect the implementation of get_rows_by_mask to also be library-specific.

MarcoGorelli commented 1 year ago

Right, so say we have

class PandasColumn:
    def __init__(self, column):
        self._series = column

and

class PandasDataFrame:
    def get_column_by_name(self, label):
        if not isinstance(label, str):
            raise TypeError()
        return PandasColumn(self.dataframe.loc[:, label])

and then

df  # pandas dataframe
df_std = df.__consortium_dataframe__()
mask = df_std.get_column_by_name('a') > 1
df_std.get_rows_by_mask(mask)

How would we implement get_rows_by_mask? Would we just use private method, like ._series in the above example, which we know exists for PandasColumn by might not exist in general?

class PandasDataFrame:
    def get_rows_by_mask(self, mask):
         return PandasDataFrame(self.dataframe.loc[mask._series])

If so, OK, this sounds fine to me

jbrockmendel commented 1 year ago

Is the concern about getting e.g. a VaexColumn instead of a PandasColumn? id be fine ruling out mixing-and-matching like that

rgommers commented 1 year ago

This seems addressed, the last implementation in @MarcoGorelli's comment above works, and I agree with @jbrockmendel's answer. Can we close this?

MarcoGorelli commented 1 year ago

yup! thanks