data-apis / dataframe-api

RFC document, tooling and other content related to the dataframe API standard
https://data-apis.org/dataframe-api/draft/index.html
MIT License
98 stars 20 forks source link

Dataframe namespaces #23

Open datapythonista opened 3 years ago

datapythonista commented 3 years ago

In #10, it's been discussed that it would be convenient if the dataframe API allows method chaining. For example:

import pandas

(pandas.read_csv('countries.csv')
       .rename(columns={'name': 'country'})
       .assign(area_km2=lambda df: df['area_m2'].astype(float) / 100)
       .query('(continent.str.lower() != "antarctica") | (population < area_km2)'))

This implies that most functionality is implemented as methods of the dataframe class. Based on pandas, the number of methods can be 300 or more, so it may be problematic to implement everything in the same namespace. pandas uses a mixed approach, with different techniques to try to organize the API.

Approaches

Top-level methods

df.sum()
df.astype()

Many of the methods are simply implemented directly as methods of dataframe.

Prefixed methods

df.to_csv()
df.to_parquet()

Some of the methods are grouped with a common prefix.

Accessors

df.str.lower()
df.dt.hour()

Accessors are a property of dataframe (or series, but assuming only one dataframe class for simplicity) that groups some methods under it.

Functions

pandas.wide_to_long(df)
pandas.melt(df)

In some cases, functions are used instead of methods.

Functional API

df.apply(func)
df.applymap(func)

pandas also provides a more functional API, where functions can be passed as parameters

Standard API

I guess we will agree, that a uniform and consistent API would be better for the standard. That should make things easier to implement, and also a more intuitive experience for the user.

Also, I think it would be good that the API can be extended easily. Couple of example of how pandas can be extended with custom functions:

@pd.api.extensions.register_dataframe_accessor('my_accessor')
class MyAccessor:
    def my_custom_method(self):
        return True

df.my_accessor.my_custom_method()
df.apply(my_custom_function)
df.apply(numpy.sum)

Conceptually, I think there are some methods that should go together, more than by topic, by the API they follow. The clearest example is reductions, and there was some discussion in https://github.com/pydata-apis/dataframe-api/issues/11#issuecomment-644115670.

I think no solution will be perfect, and the options that we have are (feel free to add to the list if I'm missing any option worth considering):

Top-level methods

df.sum()

Prefixed methods

df.reduce_sum()

Accessors

df.reduce.sum()

Functions

mod.reductions.sum(df)

mod represents the implementation module (e.g. pandas)

Functional API

df.reduce(mod.reductions.sum)

Personally, my preference is the functional API. I think it's the simplest that keeps things organized, and the simplest to extend. The main drawback is its readability, it may be too verbose. There is the option to allow using a string instead of the function for known functions (e.g. df.reduce('sum')).

Thoughts? Other ideas?

devin-petersohn commented 3 years ago

@datapythonista Great writeup. What do you see as the biggest benefit of defining things as df.reduce(mod.reductions.sum)?

From the my perspective, it feels a bit cumbersome, but I can see value in seeing that the shape (dimension) of the data will change (in this example) without needing to understand details of sum.

I don't think I am leaning in any particular way yet, but I want to understand your thoughts.

datapythonista commented 3 years ago

Thanks @devin-petersohn, I 100% agree on your comments. The syntax is trickier than some other implementation, and also we need to consider extra parameters, for example: df.reduce(mod.reductions.var, ddof=1).

The main advantages I see are:

I think similar things can be achieved with another syntax, like for example df.reduce.sum(). But IMHO everything becomes trickier. For users it's somehow magic what's going on under the hood. Reductions need to be registered, instead of just used directly. The syntax itself doesn't seem to imply that all the reductions implement the same syntax, so validating and enforcing it feels somehow magic.

But to me, the main thing would be to use a divide and conquer approach to reduce the complexity. I see this as similar to what Linux achieves with the X server (and many other components). You're building a complex piece of software, and instead of dealing with all the complexity in desktops, you just create an interface to build on top. Besides allowing a free market of software to interact with yours, the complexity is reduced dramatically. And its modularity makes changes much simpler. Also, I think we could end up having an independent project for reductions (and maps, like string functions...). If every dataframe library needs the same reductions, and all them are implemented on top of the buffer protocol, array API or whatever, feels like it could be healthier for the ecosystem to have a common dependency (better maintained, less bugs, more optimized...).

So, in summary, even if I also think the functional API is somehow cumbersome, I think it's the one that better captures all these goals and advantages, and makes things simpler. Surely for the implementation I'd say, but also for users, who are presented a more structured interface.

rgommers commented 3 years ago

Small comment: the reduce + sum example is a bit odd, it should be reduce + add. E.g. np.sum is np.add.reduce; summing is an aggregation op that implies reduction. reduce should be combined with the element-wise operations (add, multiply, divide, substract, etc.).

rgommers commented 3 years ago

Based on pandas, the number of methods can be 300 or more, so it may be problematic to implement everything in the same namespace

A quick count with [s for s in dir(df) if not s.startswith('_')] says there are currently 217 methods + attributes. 300 would probably still be fine, but I agree way more will start to become messy.

datapythonista commented 3 years ago

A quick count with [s for s in dir(df) if not s.startswith('_')] says there are currently 217 methods + attributes. 300 would probably still be fine, but I agree way more will start to become messy.

To be more specific on the figure I gave. There are around 200 for dataframe, some more than that for series (many are the same, but I guess the union is probably between 250 and 300). And then there are the accessors, around 50 string methods, and around 50 more for datetime. So, if we implement most things in pandas, we merge series functionality into a single dataframe class, and we don't use accessors (everything is a direct method of dataframe), I guess we're talking of that order of magnitude (300 or more).

rgommers commented 3 years ago

Also, I think it would be good that the API can be extended easily. Couple of example of how pandas can be extended with custom functions:

df.apply(my_custom_function) this is not API-extending I'd say, this is regular use of the apply method.

@pd.api.extensions.register_dataframe_accessor('my_accessor')
class MyAccessor:
    def my_custom_method(self):
        return True

df.my_accessor.my_custom_method()

This one seems pretty horrifying to me. This is giving end users and consumer libraries the ability to basically monkeypatch the DataFrame object. This is Python so anyone can monkeypatch things anyway (unless things are implemented as a compiled extension, then the language won't let you), but providing an API to do this seems very odd. If Pandas would like to do that it's of course free to do so, but I'd much prefer to not standardize such a pattern.

TomAugspurger commented 3 years ago

pandas' accessors are a bit more structured than monkeypatching since we don't let you overwrite existing methods :)

But agreed that they are not appropriate or necessary for the API standard.

On Thu, Jul 30, 2020 at 6:28 AM Ralf Gommers notifications@github.com wrote:

Also, I think it would be good that the API can be extended easily. Couple of example of how pandas can be extended with custom functions:

df.apply(my_custom_function) this is not API-extending I'd say, this is regular use of the apply method.

@pd.api.extensions.register_dataframe_accessor('my_accessor') class MyAccessor: def my_custom_method(self): return True

df.my_accessor.my_custom_method()

This one seems pretty horrifying to me. This is giving end users and consumer libraries the ability to basically monkeypatch the DataFrame object. This is Python so anyone can monkeypatch things anyway (unless things are implemented as a compiled extension, then the language won't let you), but providing an API to do this seems very odd. If Pandas would like to do that it's of course free to do so, but I'd much prefer to not standardize such a pattern.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pydata-apis/dataframe-api/issues/23#issuecomment-666309905, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIRSSHOEEHBSA3BHL63R6FKOVANCNFSM4PJYGL7A .

datapythonista commented 3 years ago

Some comments made in the meeting:

maurosilber commented 9 months ago

Is this still being considered?

I'd love using an API with few methods that appear in the autocomplete (df.<TAB>), which would imply hiding them somehow from the "main namespace" (that is, neither the top-level nor the prefixed methods approaches).

I'd vote for the accessor approach:

df.reduce.sum() # or add

as it can include the functional one by creating an accessor with a __call__ method, as does pandas.DataFrame.plot.

df.reduce(np.sum)

Then, when using the autocomplete, we would only see a short list of primitive actions to perform on a DataFrame (reduce, transform, plot, export, etc).

A drawback of this would be when using auto-formatters, which can split the accessor and the chosen method from the accessor.

Instead of this:

(
    df
    .fill_nan(0)
    .max()
    .sort()
)

we would write this

(
    df
    .transform.fill_nan(0)
    .reduce.max()
    .transform.sort()
)

but would be auto-formatted to this

(
    df
    .transform
    .fill_nan(0)
    .reduce
    .max()
    .transform
    .sort()
)