Expressions - another attempt

data-apis / dataframe-api

RFC document, tooling and other content related to the dataframe API standard

https://data-apis.org/dataframe-api/draft/index.html

MIT License

102 stars 20 forks source link

Expressions - another attempt #346

Closed MarcoGorelli closed 8 months ago

MarcoGorelli commented 9 months ago

I'd just like to make another case for the expressions syntax, and to see if a simpler version of the proposal I'd previous put together might be acceptable

The current syntax isn't sufficient for new libraries

Great Tables (a new project by Posit) has recently introduced native Polars support - the way they did it really caught my attention. They didn't repeat their pandas support but with Polars functions - instead, they really make expressions a main focus: https://posit-dev.github.io/great-tables/blog/polars-styling/ . The whole thing's worth reading, but I really want to draw your attention to

As it turns out, polars expressions make styling tables very straightforward. The same polars code that you would use to select or filter combines with Great Tables to highlight, circle, or bolden text.

In this post, I’ll show how Great Tables uses polars expressions to make delightful tables, like the one below.

I was expecting this to happen, and I expect it'll happen a whole load more. If new libraries lean in to the expressions syntax, then the Standard will be dead on arrival.

If we want to promote the Standard, we need to keep up with the times. This requires extra work, but so does anything worthwhile.

The current rules break method chaining

Let's take the following:

join lineitem and supplier on 'a' (left join)
we only keep rows where column 'a' plus column 'b' is greater than 0
double the value of column 'a' and only keep that and column 'd'

You might expect to be able to do this with:

(
    lineitem.join(supplier, on="a", how="left")
    .filter((lineitem.col("a") + lineitem.col("b")) > 0)
    .assign(lineitem.col("a") * 2)
    .select("a", "d")
)

However, it will raise, because lineitem.col('a') was derived from a different dataframe than lineitem.join(supplier, on='a', how='left'), and that's not allowed. (yes, I'm aware that you can workaround this with temporary variables, but my point is: method chaining is very popular among dataframe users and devs - are we sure we don't want to support it?).

With expressions, though, there's no issue:

(
    lineitem.join(supplier, on="a", how="left")
    .filter((pdx.col("a") + pdx.col("b")) > 0)
    .select(pdx.col("a") * 2, pdx.col("d"))
)

You also don't need the extra assign statement

It's not a zero-cost abstraction

The current syntax is also not a zero-cost abstraction on Polars - trying to use only two objects (Column, DataFrame) to represent four (Series, Expr, DataFrame, LazyFrame) means that the resulting code isn't going to be as efficient as it could be:

df: pl.DataFrame
df.filter((df['a']+df['b'])>0)

is less efficient than

df: pl.DataFrame
df.filter((pl.col('a')+pl.col('b'))>0)

and the current API, in the persisted case, resolves to the first one. I don't see a way of this unfortunately.

Telling people "you can use the standard if you want, but it'll be more efficient to use the Polars API directly" is a recipe for people just using Polars and forgetting about the Standard. I'm calling it.

The way forwards

We don't necessarily need to separate DataFrame from LazyFrame. But I'm once again making the case for Expr being separate from Column.

@shwina @kkraus14 if I made a simpler version of this summer's proposal, would you be open to reconsidering this? I'm tagging you two specifically because, as far as I remember, everyone else was positive about it.

Alternatives

We need to do something here, I don't want my name on standard which is just a "pandas minus"

shwina commented 9 months ago

I agree with you: column expressions feel very natural to use here - although I maintain they are seriously limiting when you want to operate with columns across tables.

I also agree that the direction the standard is evolving is making it increasingly difficult and awkward to use.

I don't wish to be the sole voice of dissent here. If the rest of the team feels column expressions are the way to go, I'm on board. I just observe that the way things are going is going, the standard is focused now more on making it easy and efficient for lazy APIs like polars to adopt it, and increasingly more difficult for eager APIs like cuDF to do so. I'm really just hoping we can borrow/steal most of what pandas builds here to support the standard.

I don't agree that this part is true though:

You might expect to be able to do this with ( lineitem.join(supplier, on="a", how="left") .filter((lineitem.col("a") + lineitem.col("b")) > 0) .assign(lineitem.col("a") * 2) .select("a", "d") )

In pandas, I'd use something like .pipe() - which does allow method chaining here.

MarcoGorelli commented 9 months ago

A Standard which doesn't allow for maximally efficient Polars code will end up being just another abandoned github project which nobody will use

If this means we have to put it some extra work, then we put in some extra work

I'm really just hoping we can borrow/steal most of what pandas builds here to support the standard.

Assuming you implement all the Column methods, then this will be relatively straightforward. You'd just need to add some definitions which defer to the respective Column method, e.g.

    def __add__(self, other: Expr | Any) -> Expr:
        if isinstance(other, Expr):
            return Expr(lambda df: self.call(df) + other.call(df))
        return Expr(lambda df: self.call(df) + other)

kkraus14 commented 9 months ago

I agree that using namespace level column expressions as opposed to dataframe specific column expressions for interacting with intermediate tables is a much cleaner experience. My concern has always been that it's an implementation change for many libraries and we had previously pushed back against doing that where the goal was to be compatible with all relevant DataFrame libraries as they stood.

The gaps for these expressions that I'm aware of are that the Pandas implementation is a WIP, Dask doesn't support them, and cuDF doesn't support them.

Telling people "you can use the standard if you want, but it'll be more efficient to use the Polars API directly" is a recipe for people just using Polars and forgetting about the Standard. I'm calling it.

A Standard which doesn't allow for maximally efficient Polars code will end up being just another abandoned github project which nobody will use

I 100% disagree with these statements and they're quite frustrating to read. Polars is one dataframe project out of many. You could make the same argument for every dataframe library that is trying to be supported by this standard and it's not possible to have maximally efficient code for every library due to API and design differences between them. Some food for thought from pypi stats:

Polars is downloaded ~3.4MM times a month
Dask is downloaded ~5.7MM times a month
PySpark is downloaded ~28.8MM times a month
Pandas is downloaded ~149.7MM times a month

MarcoGorelli commented 9 months ago

I'm sorry to have frustrated you

Let's go over which projects have shown interest in the Standard:

scikit-learn: Polars and pandas are currently the only two dataframe libraries they support in set_output
skrub: primarily interested in Polars support
scikit-lego: primarily interested in Polars support
temporian: primarily interested in Polars support
buckeroo: primarily interested in Polars support

I'm not aware of libraries showing interest in the Standard without being primarily based on interest in Polars. Counterexamples are welcome, happy to admit I'm wrong here.

The standard being a zero-cost abstraction on all major dataframe libraries including Polars is a necessary condition for adoption. I'm not interested in authoring a "pandas minus" standard.

Seeing as you bring download stats up, you may be interested in looking at these numbers over time:

Screenshot 2023-12-29 202437

kkraus14 commented 9 months ago

The standard being a zero-cost abstraction on all major dataframe libraries including Polars is a necessary condition for adoption.

Based on the discussion and iteration thus far I'm not sure if this is possible. There's fundamental design differences between the different dataframe libraries (lazy vs eager, cpu vs gpu, supported APIs, etc.) that may be impossible to make zero-cost for all libraries.

I'm not interested in authoring a "pandas minus" standard.

Neither am I, but I am interested in authoring something that as many dataframe libraries as possible, both current and future, can adopt as easily as possible.

Let's go over which projects have shown interest in the Standard:

scikit-learn: Polars and pandas are currently the only two dataframe libraries they support in set_output

skrub: primarily interested in Polars support

scikit-lego: primarily interested in Polars support

temporian: primarily interested in Polars support

buckeroo: primarily interested in Polars support

scikit-learn: https://github.com/scikit-learn/scikit-learn/issues/25896
- "Support other dataframes like polars and pyarrow not just pandas"
skrub: https://github.com/skrub-data/skrub/discussions/719
- "However, with the rise of polars, duckdb, and many more, we have to wonder which modules to support."
scikit-lego: https://github.com/koaning/scikit-lego/pull/597
- I can't dig up the Discord conversation, but seems like you drove this work yourself?
temporian:
- Can't find anything in the repo related to the dataframe standard or polars, only a notebook comparing their performance to Polars and Pandas: https://github.com/tryolabs/temporian-examples/blob/main/examples/grouped_simple_moving_average.ipynb
buckeroo: https://github.com/paddymul/buckaroo/issues/75
- Nothing related to adoption of the standard or even a desire to adopt the standard other than using the work here as inspiration for a JavaScript dataframe API. Also seems like you drive this work yourself based on the comment from the author saying they connected with you.

I don't think anyone is arguing that Polars is a major dataframe library and we should strive to make sure that the work we do in designing a new API is as friendly to it as possible, but from the public discussions it seems like it's always pointing towards wanting to support dataframe libraries more generally and the places where it's primarily Polars has been after connecting with you in Discord or elsewhere?

If we don't wish to more or less treat different dataframe libraries on equal footing then I don't see how my continued involvement in this work is an effective use of my time or efforts. I'll happily step away from this effort and evaluate it when a specification is published at some point in the future.

MarcoGorelli commented 9 months ago

If you want efficient Polars code, you write expressions, see the user guide

Expressions detract nothing from pandas/cudf and enable things for polars (and probably pyspark too)

If we truly want to treat different dataframe libraries on equal footing, we introduce expressions as a distinct concept from columns

MarcoGorelli commented 9 months ago

Signing off for the week

These discussions are affecting my mental health and I may need to drop off from the project

rgommers commented 9 months ago

Hi all, this is an important discussion. I'll add a few thought on this now. We should aim to have a higher-bandwidth conversation about it this week I think.

Some of my thoughts on the topic and conversation above:

No one wants to simply copy pandas or a "pandas minus", that's been clear from the start (all the way back to gh-4).
The download stats are probably more distracting than helpful. There are a number of widely used libraries with different APIs and internals; some are currently more popular, some have more momentum, etc. I think we'd like a design that can work well for as many of them as possible.
Regarding overhead, the statements above mostly seem a bit too absolute. This is the situation I believe:
- (a) on the one hand it's clearly not possible to be strictly "maximally efficient" or "zero overhead", since anywhere a new layer is introduced to bridge differences in APIs or semantics, there will always be at least some extra method call or dispatch overhead.
- (b) on the other hand, if the overhead is not minor at the application level, but leads to different performance scaling compared to the native performance of the implementing library, or makes certain code multiple times slower by preventing the library from optimizations that it would otherwise be able to do, then that clearly is a problem.
- (c) we need something that is a clear win for dataframe-consuming code authors. If see a single code path with only negligible/minor performance overhead, and they can avoid two or more separate code paths to support multiple dataframe libraries - great. If they'd vote with their feet to implement separate code paths anyway, then we've done something not so useful.

My interpretation of some of the discussion above is that @MarcoGorelli is worried that the current design is situation 3(b) for Polars (and that not being the case for any library if we'd go with expressions), while @kkraus14 is emphasizing point 3(a).

I also agree that the direction the standard is evolving is making it increasingly difficult and awkward to use.

That is a concern indeed, the rules for what can and cannot be done with columns seem to get more awkward as we learn more.

I agree that using namespace level column expressions as opposed to dataframe specific column expressions for interacting with intermediate tables is a much cleaner experience. My concern has always been that it's an implementation change for many libraries and we had previously pushed back against doing that where the goal was to be compatible with all relevant DataFrame libraries as they stood.

This is the important part in a nutshell I think. It seems everyone is in agreement that expressions are a nicer design (glossing over the details here). It's the implications for libraries that are the concern. The question I have is how real a problem that really is, given that:

It's been clear for quite a while that this API is not going to end up as the main user-facing API in the main namespace of each library (that would either mean large bc-breaking changes or be a "pandas minus", and those have been ruled out).
If it's a separate API/namespace, I think Marco is making the argument that it's actually pretty easy to implement as thin wrappers around the current API, and he's willing to do a lot of the work (partially already done during prototyping).

Re the "compatible with all relevant DataFrame libraries", that's not quite accurate, since there is no common denominator that can be targeted. "easy to implement for all relevant libraries" is more the point I think.

The gaps for these expressions that I'm aware of are that the Pandas implementation is a WIP, Dask doesn't support them, and cuDF doesn't support them.

I think "the Pandas implementation" doesn't exist; @jorisvandenbossche did a prototype demo last year, and @MarcoGorelli did some prototyping more recently. However, I'd also say that that's probably a chicken-and-egg type of problem - it will never exist if it's not worked on because it doesn't yet exist. The more interesting questions are whether pandas & other libraries would add it, and how much work it would be. Both are relevant questions; even if it's easy to implement, it's possible maintainers of a library would react like "expressions are a concept that make our library harder to understand without having any added value, and hence we don't like it". On the other hand, if it does lead to a nicer API or new performance optimization opportunities, or something else like that, it's a much better story. So that's probably something to get a wider range of opinions on.

MarcoGorelli commented 9 months ago

Right - new week, new discussions. Let's try to move things forwards rather than giving up 💪 Thanks Ralf for your comments.

First, let's clear up what we're talking about. An expression is a function from a dataframe to a column. I've heard people here refer to them as "lazy columns", but I don't think that's right. We wouldn't call

def add_one(v: int) -> int:
    return v+1

a lazy integer. It's a function. Likewise, an expression is a function. A Callable[[DataFrame], Column].

What's the effort involved?

Expr class

Here's an example of how to define it:

def col(self, column_name: str) -> Expr:
    return Expr.from_column_name(column_name)

class Expr:
    def __init__(self, call: Callable[[DataFrame], Column]) -> None:
        self.call = call

    @classmethod
    def from_column_name(cls: type[Expr], column_name: str) -> Expr:
        return cls(
            lambda df: Column(
                df.dataframe.loc[:, column_name],
                api_version=df._api_version,
            ),
        )

Then, for each method you want to define, you just defer to the Column implementation. For example, for __gt__:

    def __ge__(self, other: Expr | AnyScalar) -> Expr:
        if isinstance(other, Expr):
            return Expr(lambda df: self.call(df).__ge__(other.call(df)))
        return Expr(lambda df: self.call(df).__ge__(other))

It's a bit laborious, but nothing hard. It can be done in an afternoon whilst listening to a good podcast.

Expr-consuming function

Let's use select as an example. Suppose you currently have it as:

    def select(self, *columns: str) -> DataFrame:
        if not columns:
            msg = "must select at least one column"
            raise ValueError(msg)
        new_cols = [
            self.dataframe.loc[:, col]
            for col in columns
        ]
        df = pd.concat(new_cols, axis=1, copy=False)
        return self._from_dataframe(df)

Here's the change you'd need to make in order for it to take expressions (instead of just strings):

-    def select(self, *columns: str) -> DataFrame:
+    def select(self, *columns: str | Expr) -> DataFrame:
         if not columns:
             msg = "must select at least one column"
             raise ValueError(msg)
         new_cols = [
-            self.dataframe.loc[:, col]
+            col.call(self).column if isinstance(col, Expr) else self.dataframe.loc[:, col]
             for col in columns
         ]
         df = pd.concat(new_cols, axis=1, copy=False)

Note that this is a very simplified version of Polars expressions (which can have multiple roots and multiple outputs). I'm not suggesting that everyone has to reimplement Polars. I'm only suggesting a simplified definition of "expression" which, if limited to the standard's API, is pretty simple to implement. There are some subtleties to discuss, but I'm purposefully leaving them out of the present discussion.

The purpose of this post is to check whether people would at least be open to discussing it. Based on comments above, the answer appears "no", but the benefits of what I'm suggest seem clear enough to me that I won't be deterred.

Simplifications

The current design is pretty awkward because we're trying to squash different concepts into just Column.

Personally, I'd be happy to get rid of a lot of the current awkwardness, such as "you can't compare columns which originated from different dataframes", if we could just separate the concepts. It would be a lot simpler to teach then. If it's simpler to teach, it's more appealing to developers, and ends up benefitting the ecosystem more.

MarcoGorelli commented 9 months ago

Hi all,

I've decided to leave the project. This isn't a decision taken lightly, I've been considering it for some time - after yesterday's call, I slept on it, and finally have resolved to move on. Unfortunately I'm past my tipping point

This is nothing personal. I'm just physically and mentally unable to continue.

I wish you all the best and hope we can stay on good terms