Closed MarcoGorelli closed 9 months ago
Slightly dreading starting the conversation though, and the downside is that the minimum pandas version supported by the standard would have to rise to 2.2
An alternative could be that in dataframe-api-compat
I just make a decorator, so people can write df-agnostic functions like this:
from typing import Any
from dataframe_api_compat import dataframe_api
@dataframe_api(api_version='2023.11-beta')
def my_dataframe_agnostic_function(df: DataFrame) -> Any:
for column_name in df.column_names:
new_column = df.col(column_name)
new_column = (new_column - new_column.mean()) / new_column.std()
df = df.assign(new_column.rename(f'{column_name}_scaled'))
return df.dataframe
Then we don't need to bother pandas, and this looks pretty clean anyway
Folks may not want to take on the dataframe-api-compat
package as a dependency, even given it's small, pure python, and vendorable.
I have no objections to the name change other than it may be a bit confusing when working across arrays, dataframes, and other future types that may have efforts to standardize APIs.
We should probably also have our spec include this dunder method as part of the DataFrame
, Column
, and maybe Scalar
classes?
It's already mentioned here:
I don't think DataFrame
/ Column
/ Scalar
need it, this is just the entry-point for going for "non-necessarily-standard-compliant" to "standard-compliant"
If you have a DataFrame
as defined in our spec, it's already standard-compliant, and you'd have no need to call __dataframe_consortium_standard__
on it
I don't think
DataFrame
/Column
/Scalar
need it, this is just the entry-point for going for "non-necessarily-standard-compliant" to "standard-compliant"If you have a
DataFrame
as defined in our spec, it's already standard-compliant, and you'd have no need to call__dataframe_consortium_standard__
on it
If I get an arbitrary dataframe as input and I want to confirm it's standard-compliant, how do I do that today? In my mind the easiest way would be to have standard-compliant classes implement __dataframe_consortium_standard__
that return self
.
there's __dataframe_namespace__
for that
there's
__dataframe_namespace__
for that
That returns the namespace and not a compliant dataframe object. So the code would end up looking like:
def get_compliant_dataframe(df):
if hasattr(df, "__dataframe_namespace__"):
return df
else:
return df.__dataframe_consortium_standard__(...)
It feels a bit clunky but I guess it's not too bad?
It feels a bit clunky but I guess it's not too bad?
yeah, and as Ralf said, in the end, people will probably just write their own helper functions
might as well close then, this isn't too bad
If https://github.com/data-apis/dataframe-api/pull/308 goes in, then the return value of
Column.get_value
will change. It will no longer be a Python scalar, but aScalar
This means I'll have to update the tests in pandas/Polars:
https://github.com/pandas-dev/pandas/blob/f777e67d2b29cda5b835d30c855b633269f5e8e8/pandas/tests/test_downstream.py#L340-L344
I'll change it to something much simpler that realistically will never break, like asserting something about
result.name
If I'm going to have to change things upstream, I'd like to take the chance the rename the entrypoint
__dataframe_consortium_standard__
is just...long. Originally we'd suggested__dataframe_standard__
, but Brock correctly pointed out that this has normative connotationsWe're starting to get positive responses (see https://github.com/koaning/scikit-lego/pull/597, https://github.com/skrub-data/skrub/pull/786), so the time to make changes is running out
My hope is that this would then need to be the last upstream update. The rest, we can handle here / in
dataframe-api-compat