Open ritchie46 opened 1 month ago
Thanks for opening this issue and apologies for the lack of engagement on https://github.com/coiled/benchmarks/issues/1498 so far.
Our intention is not to publish any unfair results and we are interested in comparing engines on equal footing to the best of our ability.
You are mentioning a couple of different topics and some of them require further discussion. From what I can tell, you are raising these problems
I will grant you that 1.) is indeed not correct in some situations. This wasn't done intentionally and we still have to address this. I think https://github.com/coiled/benchmarks/issues/1498 describes this problem already such that I will not open a separate issue.
I think the two other issues require a bit further debate.
Dask is using a leftsemi
statement in three queries, namely query 4, 18 and 20
Let's talk about Q4 in detail since it's the simplest example. The functional definition of the query according to the TPC-H specification is
select
o_orderpriority,
count(*) as order_count
from
orders
where
o_orderdate >= date '[DATE]'
and o_orderdate < date '[DATE]' + interval '3' month
and exists (
select
*
from
lineitem
where
l_orderkey = o_orderkey
and l_commitdate < l_receiptdate
)
group by
o_orderpriority
order by
o_orderpriority;
The leftsemi
join method is used to implement the exist
statement on a correlated subquery (see the where
clause l_orderkey = o_orderkey
) in the above SQL expression. Translating this to a DataFrame is a little difficult since SQL is a declarative language and DataFrames are typically imperative.
My attempt to put into words what the above query is describing...
"Take every row of left
(i.e. order
) for which there exists a matching (l_orderkey = o_orderkey) row in right
(i.e. lineitem
) which also satisfies the condition l_commitdate < l_receiptdate
".
This is pretty much the definition of a leftsemi
join with a filter applied, isn't it?
Comparing this to how polars is implementing this query, namely joining the table with a simple left
join and subsequently dropping duplicates on ["o_orderpriority", "l_orderkey"]
is semantically a different operation. If the left table already included duplicates, this operation would yield a different result. The left table does not include duplicates and it is therefore yielding the same result. However, that approach includes knowledge about the data while the leftsemi
join does not.
The cases for query 18 and 20 are more complex and instead of exists
they describe subqueries with in
and >
operators but the argument is pretty much the same.
I would actually encourage polars to use a semi join as well. Thoughts?
As an example, you are pointing here to Q1 and especially how the following line is being translated.
select
...
sum(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge,
...
from
lineitem
...
In dask and pandas we currently don't have a way to express more complex columnar expressions like this. Instead, we do...
lineitem_filtered["sum_charge"] = (
lineitem_filtered.l_extendedprice
* (1 - lineitem_filtered.l_discount)
* (1 + lineitem_filtered.l_tax)
)
...
gb = lineitem_filtered.groupby(["l_returnflag", "l_linestatus"])
total = gb.agg(
{
...
"sum_charge": "sum",
...
}
)
I fully agree that the way polars can describe this operation using expressions is more elegant. We currently don't support this but have it on our roadmap (see https://github.com/dask/dask-expr/issues/386). I see how optimizers could produce suboptimal results or how time spent on optimization could skew the results but I would argue that this is negligible. On the other hand, this does unlock a whole range of optimizations that we currently don't have access to so I would likely argue that engines like polars or spark have an advantage over dask and pandas because of this. I think we have to acknowledge this bias but I wouldn't want to stop comparing engines because of this. There is no DataFrame standard similar to the SQL ISO definition so we have to make the best out of what we have.
Thanks for engaging on this @fjetter. I think what we should try to achieve is to come up with a way for the dataframe queries that fit the SQL best, for both Polars and Dask in that matter.
Given your explanation of left_semi
as a tranlation to exist
. I understant and agree that that is a fair translation in that sense. I think the Polars query must be updated then to do the same. We haven't come to this translation and it might be wrong (as you mentioned on the duplicates).
I see how optimizers could produce suboptimal results or how time spent on optimization could skew the results but I would argue that this is negligible
Sorry, I don't think I follow. What do you think is neglible? If an optimizer doesn't come to certain conclusion a query can explode in runtime. For instance q5 doesn't even finish if we turn off optimizations.
On the other hand, we also do optimizations that are fairly expensive. For instance CSE. And coming to a correct conclusion that Dask applied to the group-by pre-aggregations is not per-se trivial. I understand that might be a current limitation in the API, but I do think these can have significant runtime effects if an optimizer cannot prove CSE can be applied earlier.
I think we agree on point 1. and 2. and can align the queries on those fronts. As I think it is important that we start from the same starting blocks (as much as possible).
For the filtering locations: these queries are all checked with narwhals to ensure they both compile down to the same eager queries with regard to filters:
Sorry, I don't think I follow. What do you think is neglible? If an optimizer doesn't come to certain conclusion a query can explode in runtime. For instance q5 doesn't even finish if we turn off optimizations.
sorry if I haven't made myself clear. I was specifically and exclusively referring to the introduction of column expressions. The introduction of column expressions would have to be interpreted by the optimizer which could yield suboptimal results and additional computation overhead compared to what we are doing right now since we are already writing down the aggregation in a different, more explicit form. However, I believe that the optimization overhead introduced by using columnar expressions opposed to a version without is negligible since this should not be the cost driving part of the optimization. I might be wrong, of course. I'm not familiar with how this is implemented in polars and we haven't invested much time to think about how it will be done in dask-expr so this may not be true.
When I'm talking about whether things are negligible, I'm also thinking about the very large runs on scale 1k or 10k which all run for minutes so if the optimizer is done in less than a second, I don't care much about how fast it is.
FWIW If we disabled query optimizations, dask wouldn't be able to run most of the queries at all. (We've been writing a bit about this over here)
I think they should be another point added as well. Join reordering.
This has huge impact over performance. Polars doesn't have join re-ordering optimizer yet, but still we choose to keep true to the SQL translation. However I see Dask queries having a manually chosen better ordering. This is not fair to Polars and SQL solutions as those need to figure out the join ordering themselves and if they don't the memory/runtime can explode.
I think that the current benchmarks are completely apples to peaches as it is now.
(as a heads-up, Florian is out on vacation this coming week, and may be less responsive)
On Sat, May 25, 2024 at 5:59 AM Ritchie Vink @.***> wrote:
I think they should be another point added as well. Join reordering.
This has huge impact over performance. Polars doesn't have join re-ordering optimizer yet, but still we choose to keep true to the SQL translation. However I see Dask queries having a manually chosen better ordering. This is not fair to Polars and SQL solutions as those need to figure out the join ordering themselves and if they don't the memory/runtime can explode.
I think that the current benchmarks are completely apples to peaches as it is now.
image.png (view on web) https://github.com/coiled/benchmarks/assets/3023000/3335f120-d497-49b0-880d-051c2ede6613
— Reply to this email directly, view it on GitHub https://github.com/coiled/benchmarks/issues/1515#issuecomment-2131212969, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTAF45GE2JPPFUI3CSTZEBVKLAVCNFSM6AAAAABIBQVFS6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZRGIYTEOJWHE . You are receiving this because you are subscribed to this thread.Message ID: @.***>
--
Matthew Rocklin CEO, Dask Maintainer
Yes, join ordering is something that is done manually to some extend right now. We've been transparent about this when publishing the results, see https://docs.coiled.io/blog/tpch.html#dask There is an open issue to support automatic join reordering in dask-expr here https://github.com/dask/dask-expr/issues/1065
The query you are pointing to is Q9. This is actually an interesting example. The functional SQL definition is
select
nation,
o_year,
round(sum(amount), 2) as sum_profit
from
(
select
n_name as nation,
year(o_orderdate) as o_year,
l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity as amount
from
part,
supplier,
lineitem,
partsupp,
orders,
nation
where
s_suppkey = l_suppkey
and ps_suppkey = l_suppkey
and ps_partkey = l_partkey
and p_partkey = l_partkey
and o_orderkey = l_orderkey
and s_nationkey = n_nationkey
and p_name like '%green%'
) as profit
group by
nation,
o_year
order by
nation,
o_year desc
Following this very strictly would mean you'd have to join part
to supplier
first. However, there is no relation defined between those two tables. The relation is introduced via partsupp
which only comes at fourth position so the ordering has to be amended to even make sense for an imperative API.
The relations between the tables are defined in the where-clause so we could instead also follow the where-clause ordering which would again look differently. I think the polars implementation is a subtle variation of the where-clause ordering but that may be accidental.
In this example, the reordering was required to make sense of the query. There are other examples where we did reorder manually to make the query faster. One example is Q18 where we started off with the same ordering as the tables listed in the from clause of the functional SQL. In this example, polars uses an ordering that is also different to how the tables are listed in SQL and we adopted this here instead for improved performance. There are a couple of other examples around where we looked at query ordering ourselves. For example, Q21 which is already pretty tricky to express and requires some interpretation due to using multiple correlated subqueries with self joins and I honestly don't know what "true" SQL ordering would look like.
So, you are very right. We are comparing apples to peaches again because we are comparing a declarative language to an imperative API. There is no universal, unambiguous set of rules (that I am aware of) that translates SQL to a DataFrame API. This example shows that even with a join order optimizer as part of the DataFrame API, this ambiguity still exists to some extend.
I'm sorry if this feels unfair to polars as the only other DataFrame API library. We just used the polars code that was in your repo at the time but haven't gone through optimizing it. I'm not sure how to do this better.
Yes, join ordering is something that is done manually to some extend right now. We've been transparent about this when publishing the results, see https://docs.coiled.io/blog/tpch.html#dask
A comment doesn't reach as far as a graph. ;) I just modified the q9 according to the dask query ordering and saw a 2.5x improvement on the Polars side.
I agree that doing fair benchmark between a declarative vs imperative API is hard. However, what I do think is important is to do the same wherever it is possible.
I'm sorry if this feels unfair to polars as the only other DataFrame API library. We just used the polars code that was in your repo at the time but haven't gone through optimizing it. I'm not sure how to do this better.
We can start by applying the same join orderings for the DataFrame API's, if you do a manual join reordering for Dask you should do it for Polars as well, and I don't think a comment is enough here as that means the chart will be misleading and those will be often shared without context.
I can do a pass in the Polars queries upstream to make them more similar to the Dask ones. And I think any hand written optimization should be tried to applied to other tools whenever possible. Upstream we will make an effort to do the same.
Similar #1498. I think that as the queries are currently written it isn't a fair comparison between DataFrame API's.
For SQL it is fair as the TPCH benchmark states that all engines should parse the same SQL. However for DataFrames with different API's this is ofcourse harder.
TPCH is not only a operation benchmark, it is also an optimization benchmark. If we compare against SQL engines, the DataFrame API's shouldn't enjoy more optimizations than the SQL queries.
For this reason we removed hand written optimizations on the polars tpch benchmarks in Polars, pandas, Dask and modin:
To make the benchmarks apples to apples I think this should be done in this repo as well. The Dask queries for instance show hand written "leftsemi"
pre-filtering:
< actually almost all queries in some form >
And pre-aggregation computation where this defined in the aggregation context.
In many group-by's:
For this one, we are lenient towards pandas API's upstream as there isn't a good way to describe this. But since
dask-expr
is developed I believe Dask should just inline this and leave this to the optimizer.If we compare DataFrame API's and SQL I truly believe it is important that we give all implementations a fair starting position. Optimization engines could come to the same conclusions, but have to spend compute to do so, and often will not come to the most optimal conclusions.