Closed MarcoGorelli closed 8 months ago
I agree with you: column expressions feel very natural to use here - although I maintain they are seriously limiting when you want to operate with columns across tables.
I also agree that the direction the standard is evolving is making it increasingly difficult and awkward to use.
I don't wish to be the sole voice of dissent here. If the rest of the team feels column expressions are the way to go, I'm on board. I just observe that the way things are going is going, the standard is focused now more on making it easy and efficient for lazy APIs like polars to adopt it, and increasingly more difficult for eager APIs like cuDF to do so. I'm really just hoping we can borrow/steal most of what pandas builds here to support the standard.
I don't agree that this part is true though:
You might expect to be able to do this with ( lineitem.join(supplier, on="a", how="left") .filter((lineitem.col("a") + lineitem.col("b")) > 0) .assign(lineitem.col("a") * 2) .select("a", "d") )
In pandas, I'd use something like .pipe()
- which does allow method chaining here.
A Standard which doesn't allow for maximally efficient Polars code will end up being just another abandoned github project which nobody will use
If this means we have to put it some extra work, then we put in some extra work
I'm really just hoping we can borrow/steal most of what pandas builds here to support the standard.
Assuming you implement all the Column
methods, then this will be relatively straightforward. You'd just need to add some definitions which defer to the respective Column
method, e.g.
def __add__(self, other: Expr | Any) -> Expr:
if isinstance(other, Expr):
return Expr(lambda df: self.call(df) + other.call(df))
return Expr(lambda df: self.call(df) + other)
I agree that using namespace level column expressions as opposed to dataframe specific column expressions for interacting with intermediate tables is a much cleaner experience. My concern has always been that it's an implementation change for many libraries and we had previously pushed back against doing that where the goal was to be compatible with all relevant DataFrame libraries as they stood.
The gaps for these expressions that I'm aware of are that the Pandas implementation is a WIP, Dask doesn't support them, and cuDF doesn't support them.
Telling people "you can use the standard if you want, but it'll be more efficient to use the Polars API directly" is a recipe for people just using Polars and forgetting about the Standard. I'm calling it.
A Standard which doesn't allow for maximally efficient Polars code will end up being just another abandoned github project which nobody will use
I 100% disagree with these statements and they're quite frustrating to read. Polars is one dataframe project out of many. You could make the same argument for every dataframe library that is trying to be supported by this standard and it's not possible to have maximally efficient code for every library due to API and design differences between them. Some food for thought from pypi stats:
I'm sorry to have frustrated you
Let's go over which projects have shown interest in the Standard:
I'm not aware of libraries showing interest in the Standard without being primarily based on interest in Polars. Counterexamples are welcome, happy to admit I'm wrong here.
The standard being a zero-cost abstraction on all major dataframe libraries including Polars is a necessary condition for adoption. I'm not interested in authoring a "pandas minus" standard.
Seeing as you bring download stats up, you may be interested in looking at these numbers over time:
The standard being a zero-cost abstraction on all major dataframe libraries including Polars is a necessary condition for adoption.
Based on the discussion and iteration thus far I'm not sure if this is possible. There's fundamental design differences between the different dataframe libraries (lazy vs eager, cpu vs gpu, supported APIs, etc.) that may be impossible to make zero-cost for all libraries.
I'm not interested in authoring a "pandas minus" standard.
Neither am I, but I am interested in authoring something that as many dataframe libraries as possible, both current and future, can adopt as easily as possible.
Let's go over which projects have shown interest in the Standard:
- scikit-learn: Polars and pandas are currently the only two dataframe libraries they support in set_output
- skrub: primarily interested in Polars support
- scikit-lego: primarily interested in Polars support
- temporian: primarily interested in Polars support
- buckeroo: primarily interested in Polars support
I don't think anyone is arguing that Polars is a major dataframe library and we should strive to make sure that the work we do in designing a new API is as friendly to it as possible, but from the public discussions it seems like it's always pointing towards wanting to support dataframe libraries more generally and the places where it's primarily Polars has been after connecting with you in Discord or elsewhere?
If we don't wish to more or less treat different dataframe libraries on equal footing then I don't see how my continued involvement in this work is an effective use of my time or efforts. I'll happily step away from this effort and evaluate it when a specification is published at some point in the future.
If you want efficient Polars code, you write expressions, see the user guide
Expressions detract nothing from pandas/cudf and enable things for polars (and probably pyspark too)
If we truly want to treat different dataframe libraries on equal footing, we introduce expressions as a distinct concept from columns
Signing off for the week
These discussions are affecting my mental health and I may need to drop off from the project
Hi all, this is an important discussion. I'll add a few thought on this now. We should aim to have a higher-bandwidth conversation about it this week I think.
Some of my thoughts on the topic and conversation above:
Regarding overhead, the statements above mostly seem a bit too absolute. This is the situation I believe:
My interpretation of some of the discussion above is that @MarcoGorelli is worried that the current design is situation 3(b) for Polars (and that not being the case for any library if we'd go with expressions), while @kkraus14 is emphasizing point 3(a).
I also agree that the direction the standard is evolving is making it increasingly difficult and awkward to use.
That is a concern indeed, the rules for what can and cannot be done with columns seem to get more awkward as we learn more.
I agree that using namespace level column expressions as opposed to dataframe specific column expressions for interacting with intermediate tables is a much cleaner experience. My concern has always been that it's an implementation change for many libraries and we had previously pushed back against doing that where the goal was to be compatible with all relevant DataFrame libraries as they stood.
This is the important part in a nutshell I think. It seems everyone is in agreement that expressions are a nicer design (glossing over the details here). It's the implications for libraries that are the concern. The question I have is how real a problem that really is, given that:
Re the "compatible with all relevant DataFrame libraries", that's not quite accurate, since there is no common denominator that can be targeted. "easy to implement for all relevant libraries" is more the point I think.
The gaps for these expressions that I'm aware of are that the Pandas implementation is a WIP, Dask doesn't support them, and cuDF doesn't support them.
I think "the Pandas implementation" doesn't exist; @jorisvandenbossche did a prototype demo last year, and @MarcoGorelli did some prototyping more recently. However, I'd also say that that's probably a chicken-and-egg type of problem - it will never exist if it's not worked on because it doesn't yet exist. The more interesting questions are whether pandas & other libraries would add it, and how much work it would be. Both are relevant questions; even if it's easy to implement, it's possible maintainers of a library would react like "expressions are a concept that make our library harder to understand without having any added value, and hence we don't like it". On the other hand, if it does lead to a nicer API or new performance optimization opportunities, or something else like that, it's a much better story. So that's probably something to get a wider range of opinions on.
Right - new week, new discussions. Let's try to move things forwards rather than giving up 💪 Thanks Ralf for your comments.
First, let's clear up what we're talking about. An expression is a function from a dataframe to a column. I've heard people here refer to them as "lazy columns", but I don't think that's right. We wouldn't call
def add_one(v: int) -> int:
return v+1
a lazy integer. It's a function. Likewise, an expression is a function. A Callable[[DataFrame], Column]
.
What's the effort involved?
Here's an example of how to define it:
def col(self, column_name: str) -> Expr:
return Expr.from_column_name(column_name)
class Expr:
def __init__(self, call: Callable[[DataFrame], Column]) -> None:
self.call = call
@classmethod
def from_column_name(cls: type[Expr], column_name: str) -> Expr:
return cls(
lambda df: Column(
df.dataframe.loc[:, column_name],
api_version=df._api_version,
),
)
Then, for each method you want to define, you just defer to the Column
implementation. For example, for __gt__
:
def __ge__(self, other: Expr | AnyScalar) -> Expr:
if isinstance(other, Expr):
return Expr(lambda df: self.call(df).__ge__(other.call(df)))
return Expr(lambda df: self.call(df).__ge__(other))
It's a bit laborious, but nothing hard. It can be done in an afternoon whilst listening to a good podcast.
Let's use select
as an example. Suppose you currently have it as:
def select(self, *columns: str) -> DataFrame:
if not columns:
msg = "must select at least one column"
raise ValueError(msg)
new_cols = [
self.dataframe.loc[:, col]
for col in columns
]
df = pd.concat(new_cols, axis=1, copy=False)
return self._from_dataframe(df)
Here's the change you'd need to make in order for it to take expressions (instead of just strings):
- def select(self, *columns: str) -> DataFrame:
+ def select(self, *columns: str | Expr) -> DataFrame:
if not columns:
msg = "must select at least one column"
raise ValueError(msg)
new_cols = [
- self.dataframe.loc[:, col]
+ col.call(self).column if isinstance(col, Expr) else self.dataframe.loc[:, col]
for col in columns
]
df = pd.concat(new_cols, axis=1, copy=False)
Note that this is a very simplified version of Polars expressions (which can have multiple roots and multiple outputs). I'm not suggesting that everyone has to reimplement Polars. I'm only suggesting a simplified definition of "expression" which, if limited to the standard's API, is pretty simple to implement. There are some subtleties to discuss, but I'm purposefully leaving them out of the present discussion.
The purpose of this post is to check whether people would at least be open to discussing it. Based on comments above, the answer appears "no", but the benefits of what I'm suggest seem clear enough to me that I won't be deterred.
The current design is pretty awkward because we're trying to squash different concepts into just Column
.
Personally, I'd be happy to get rid of a lot of the current awkwardness, such as "you can't compare columns which originated from different dataframes", if we could just separate the concepts. It would be a lot simpler to teach then. If it's simpler to teach, it's more appealing to developers, and ends up benefitting the ecosystem more.
Hi all,
I've decided to leave the project. This isn't a decision taken lightly, I've been considering it for some time - after yesterday's call, I slept on it, and finally have resolved to move on. Unfortunately I'm past my tipping point
This is nothing personal. I'm just physically and mentally unable to continue.
I wish you all the best and hope we can stay on good terms
I'd just like to make another case for the expressions syntax, and to see if a simpler version of the proposal I'd previous put together might be acceptable
The current syntax isn't sufficient for new libraries
Great Tables (a new project by Posit) has recently introduced native Polars support - the way they did it really caught my attention. They didn't repeat their pandas support but with Polars functions - instead, they really make expressions a main focus: https://posit-dev.github.io/great-tables/blog/polars-styling/ . The whole thing's worth reading, but I really want to draw your attention to
I was expecting this to happen, and I expect it'll happen a whole load more. If new libraries lean in to the expressions syntax, then the Standard will be dead on arrival.
If we want to promote the Standard, we need to keep up with the times. This requires extra work, but so does anything worthwhile.
The current rules break method chaining
Let's take the following:
lineitem
andsupplier
on'a'
(left join)'a'
plus column'b'
is greater than 0'a'
and only keep that and column'd'
You might expect to be able to do this with:
However, it will raise, because
lineitem.col('a')
was derived from a different dataframe thanlineitem.join(supplier, on='a', how='left')
, and that's not allowed. (yes, I'm aware that you can workaround this with temporary variables, but my point is: method chaining is very popular among dataframe users and devs - are we sure we don't want to support it?).With expressions, though, there's no issue:
You also don't need the extra
assign
statementIt's not a zero-cost abstraction
The current syntax is also not a zero-cost abstraction on Polars - trying to use only two objects (
Column
,DataFrame
) to represent four (Series
,Expr
,DataFrame
,LazyFrame
) means that the resulting code isn't going to be as efficient as it could be:is less efficient than
and the current API, in the
persist
ed case, resolves to the first one. I don't see a way of this unfortunately.Telling people "you can use the standard if you want, but it'll be more efficient to use the Polars API directly" is a recipe for people just using Polars and forgetting about the Standard. I'm calling it.
The way forwards
We don't necessarily need to separate
DataFrame
fromLazyFrame
. But I'm once again making the case forExpr
being separate fromColumn
.@shwina @kkraus14 if I made a simpler version of this summer's proposal, would you be open to reconsidering this? I'm tagging you two specifically because, as far as I remember, everyone else was positive about it.
Alternatives
We need to do something here, I don't want my name on standard which is just a "pandas minus"