data-apis / dataframe-api

RFC document, tooling and other content related to the dataframe API standard
https://data-apis.org/dataframe-api/draft/index.html
MIT License
99 stars 20 forks source link

Rename entrypoint to `__consortium_api__`? #323

Closed MarcoGorelli closed 9 months ago

MarcoGorelli commented 10 months ago

If https://github.com/data-apis/dataframe-api/pull/308 goes in, then the return value of Column.get_value will change. It will no longer be a Python scalar, but a Scalar

This means I'll have to update the tests in pandas/Polars:

https://github.com/pandas-dev/pandas/blob/f777e67d2b29cda5b835d30c855b633269f5e8e8/pandas/tests/test_downstream.py#L340-L344

I'll change it to something much simpler that realistically will never break, like asserting something about result.name

If I'm going to have to change things upstream, I'd like to take the chance the rename the entrypoint

__dataframe_consortium_standard__ is just...long. Originally we'd suggested __dataframe_standard__, but Brock correctly pointed out that this has normative connotations

We're starting to get positive responses (see https://github.com/koaning/scikit-lego/pull/597, https://github.com/skrub-data/skrub/pull/786), so the time to make changes is running out

My hope is that this would then need to be the last upstream update. The rest, we can handle here / in dataframe-api-compat

MarcoGorelli commented 10 months ago

Slightly dreading starting the conversation though, and the downside is that the minimum pandas version supported by the standard would have to rise to 2.2

An alternative could be that in dataframe-api-compat I just make a decorator, so people can write df-agnostic functions like this:

from typing import Any

from dataframe_api_compat import dataframe_api

@dataframe_api(api_version='2023.11-beta')
def my_dataframe_agnostic_function(df: DataFrame) -> Any:
    for column_name in df.column_names:
        new_column = df.col(column_name)
        new_column = (new_column - new_column.mean()) / new_column.std()
        df = df.assign(new_column.rename(f'{column_name}_scaled'))

    return df.dataframe

Then we don't need to bother pandas, and this looks pretty clean anyway

kkraus14 commented 10 months ago

Folks may not want to take on the dataframe-api-compat package as a dependency, even given it's small, pure python, and vendorable.

I have no objections to the name change other than it may be a bit confusing when working across arrays, dataframes, and other future types that may have efforts to standardize APIs.

We should probably also have our spec include this dunder method as part of the DataFrame, Column, and maybe Scalar classes?

MarcoGorelli commented 10 months ago

It's already mentioned here:

https://github.com/data-apis/dataframe-api/blob/7be00b6082f287817853c5b16e0dd12baded7763/spec/purpose_and_scope.md#L261-L276

I don't think DataFrame / Column / Scalar need it, this is just the entry-point for going for "non-necessarily-standard-compliant" to "standard-compliant"

If you have a DataFrame as defined in our spec, it's already standard-compliant, and you'd have no need to call __dataframe_consortium_standard__ on it

kkraus14 commented 10 months ago

I don't think DataFrame / Column / Scalar need it, this is just the entry-point for going for "non-necessarily-standard-compliant" to "standard-compliant"

If you have a DataFrame as defined in our spec, it's already standard-compliant, and you'd have no need to call __dataframe_consortium_standard__ on it

If I get an arbitrary dataframe as input and I want to confirm it's standard-compliant, how do I do that today? In my mind the easiest way would be to have standard-compliant classes implement __dataframe_consortium_standard__ that return self.

MarcoGorelli commented 10 months ago

there's __dataframe_namespace__ for that

kkraus14 commented 10 months ago

there's __dataframe_namespace__ for that

That returns the namespace and not a compliant dataframe object. So the code would end up looking like:

def get_compliant_dataframe(df):
    if hasattr(df, "__dataframe_namespace__"):
        return df
    else:
        return df.__dataframe_consortium_standard__(...)

It feels a bit clunky but I guess it's not too bad?

MarcoGorelli commented 9 months ago

It feels a bit clunky but I guess it's not too bad?

yeah, and as Ralf said, in the end, people will probably just write their own helper functions

might as well close then, this isn't too bad