coiled / benchmarks

BSD 3-Clause "New" or "Revised" License
27 stars 17 forks source link

Fair dataframe API vs API vs SQL benchmarking. #1515

Open ritchie46 opened 1 month ago

ritchie46 commented 1 month ago

Similar #1498. I think that as the queries are currently written it isn't a fair comparison between DataFrame API's.

For SQL it is fair as the TPCH benchmark states that all engines should parse the same SQL. However for DataFrames with different API's this is ofcourse harder.

TPCH is not only a operation benchmark, it is also an optimization benchmark. If we compare against SQL engines, the DataFrame API's shouldn't enjoy more optimizations than the SQL queries.

For this reason we removed hand written optimizations on the polars tpch benchmarks in Polars, pandas, Dask and modin:

To make the benchmarks apples to apples I think this should be done in this repo as well. The Dask queries for instance show hand written "leftsemi"

pre-filtering:

< actually almost all queries in some form >

And pre-aggregation computation where this defined in the aggregation context.

In many group-by's:

https://github.com/coiled/benchmarks/blob/13ebb9c72b1941c90b602e3aaea82ac18fafcddc/tests/tpch/dask_queries.py#L18

For this one, we are lenient towards pandas API's upstream as there isn't a good way to describe this. But since dask-expr is developed I believe Dask should just inline this and leave this to the optimizer.

If we compare DataFrame API's and SQL I truly believe it is important that we give all implementations a fair starting position. Optimization engines could come to the same conclusions, but have to spend compute to do so, and often will not come to the most optimal conclusions.

fjetter commented 1 month ago

Thanks for opening this issue and apologies for the lack of engagement on https://github.com/coiled/benchmarks/issues/1498 so far.

Our intention is not to publish any unfair results and we are interested in comparing engines on equal footing to the best of our ability.

You are mentioning a couple of different topics and some of them require further discussion. From what I can tell, you are raising these problems

  1. Premature filtering and projections
  2. leftsemi joins
  3. Definition of groupby aggregations

I will grant you that 1.) is indeed not correct in some situations. This wasn't done intentionally and we still have to address this. I think https://github.com/coiled/benchmarks/issues/1498 describes this problem already such that I will not open a separate issue.

I think the two other issues require a bit further debate.

Leftsemi

Dask is using a leftsemi statement in three queries, namely query 4, 18 and 20

Let's talk about Q4 in detail since it's the simplest example. The functional definition of the query according to the TPC-H specification is

select
    o_orderpriority,
    count(*) as order_count
from
    orders
where
    o_orderdate >= date '[DATE]'
    and o_orderdate < date '[DATE]' + interval '3' month
    and exists (
        select
            *
        from
            lineitem
        where
            l_orderkey = o_orderkey
            and l_commitdate < l_receiptdate
    ) 
group by
    o_orderpriority
order by
    o_orderpriority;

The leftsemi join method is used to implement the exist statement on a correlated subquery (see the where clause l_orderkey = o_orderkey) in the above SQL expression. Translating this to a DataFrame is a little difficult since SQL is a declarative language and DataFrames are typically imperative.

My attempt to put into words what the above query is describing...

"Take every row of left (i.e. order) for which there exists a matching (l_orderkey = o_orderkey) row in right (i.e. lineitem) which also satisfies the condition l_commitdate < l_receiptdate".

This is pretty much the definition of a leftsemi join with a filter applied, isn't it?

Comparing this to how polars is implementing this query, namely joining the table with a simple left join and subsequently dropping duplicates on ["o_orderpriority", "l_orderkey"] is semantically a different operation. If the left table already included duplicates, this operation would yield a different result. The left table does not include duplicates and it is therefore yielding the same result. However, that approach includes knowledge about the data while the leftsemi join does not.

The cases for query 18 and 20 are more complex and instead of exists they describe subqueries with in and > operators but the argument is pretty much the same.

I would actually encourage polars to use a semi join as well. Thoughts?

Groupby aggregations

As an example, you are pointing here to Q1 and especially how the following line is being translated.

select
    ...
    sum(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge, 
    ...
from
    lineitem
...

In dask and pandas we currently don't have a way to express more complex columnar expressions like this. Instead, we do...

lineitem_filtered["sum_charge"] = (
    lineitem_filtered.l_extendedprice
    * (1 - lineitem_filtered.l_discount)
    * (1 + lineitem_filtered.l_tax)
)
...
gb = lineitem_filtered.groupby(["l_returnflag", "l_linestatus"])

total = gb.agg(
    {
        ...
        "sum_charge": "sum",
        ...
    }
)

I fully agree that the way polars can describe this operation using expressions is more elegant. We currently don't support this but have it on our roadmap (see https://github.com/dask/dask-expr/issues/386). I see how optimizers could produce suboptimal results or how time spent on optimization could skew the results but I would argue that this is negligible. On the other hand, this does unlock a whole range of optimizations that we currently don't have access to so I would likely argue that engines like polars or spark have an advantage over dask and pandas because of this. I think we have to acknowledge this bias but I wouldn't want to stop comparing engines because of this. There is no DataFrame standard similar to the SQL ISO definition so we have to make the best out of what we have.

ritchie46 commented 1 month ago

Thanks for engaging on this @fjetter. I think what we should try to achieve is to come up with a way for the dataframe queries that fit the SQL best, for both Polars and Dask in that matter.

left semi

Given your explanation of left_semi as a tranlation to exist. I understant and agree that that is a fair translation in that sense. I think the Polars query must be updated then to do the same. We haven't come to this translation and it might be wrong (as you mentioned on the duplicates).

Optimization

I see how optimizers could produce suboptimal results or how time spent on optimization could skew the results but I would argue that this is negligible

Sorry, I don't think I follow. What do you think is neglible? If an optimizer doesn't come to certain conclusion a query can explode in runtime. For instance q5 doesn't even finish if we turn off optimizations.

image

On the other hand, we also do optimizations that are fairly expensive. For instance CSE. And coming to a correct conclusion that Dask applied to the group-by pre-aggregations is not per-se trivial. I understand that might be a current limitation in the API, but I do think these can have significant runtime effects if an optimizer cannot prove CSE can be applied earlier.

Aligning queries.

I think we agree on point 1. and 2. and can align the queries on those fronts. As I think it is important that we start from the same starting blocks (as much as possible).

For the filtering locations: these queries are all checked with narwhals to ensure they both compile down to the same eager queries with regard to filters:

https://github.com/pola-rs/tpch/tree/main/queries/dask

fjetter commented 1 month ago

Sorry, I don't think I follow. What do you think is neglible? If an optimizer doesn't come to certain conclusion a query can explode in runtime. For instance q5 doesn't even finish if we turn off optimizations.

sorry if I haven't made myself clear. I was specifically and exclusively referring to the introduction of column expressions. The introduction of column expressions would have to be interpreted by the optimizer which could yield suboptimal results and additional computation overhead compared to what we are doing right now since we are already writing down the aggregation in a different, more explicit form. However, I believe that the optimization overhead introduced by using columnar expressions opposed to a version without is negligible since this should not be the cost driving part of the optimization. I might be wrong, of course. I'm not familiar with how this is implemented in polars and we haven't invested much time to think about how it will be done in dask-expr so this may not be true.

When I'm talking about whether things are negligible, I'm also thinking about the very large runs on scale 1k or 10k which all run for minutes so if the optimizer is done in less than a second, I don't care much about how fast it is.

FWIW If we disabled query optimizations, dask wouldn't be able to run most of the queries at all. (We've been writing a bit about this over here)

ritchie46 commented 1 month ago

I think they should be another point added as well. Join reordering.

This has huge impact over performance. Polars doesn't have join re-ordering optimizer yet, but still we choose to keep true to the SQL translation. However I see Dask queries having a manually chosen better ordering. This is not fair to Polars and SQL solutions as those need to figure out the join ordering themselves and if they don't the memory/runtime can explode.

I think that the current benchmarks are completely apples to peaches as it is now.

image

mrocklin commented 1 month ago

(as a heads-up, Florian is out on vacation this coming week, and may be less responsive)

On Sat, May 25, 2024 at 5:59 AM Ritchie Vink @.***> wrote:

I think they should be another point added as well. Join reordering.

This has huge impact over performance. Polars doesn't have join re-ordering optimizer yet, but still we choose to keep true to the SQL translation. However I see Dask queries having a manually chosen better ordering. This is not fair to Polars and SQL solutions as those need to figure out the join ordering themselves and if they don't the memory/runtime can explode.

I think that the current benchmarks are completely apples to peaches as it is now.

image.png (view on web) https://github.com/coiled/benchmarks/assets/3023000/3335f120-d497-49b0-880d-051c2ede6613

— Reply to this email directly, view it on GitHub https://github.com/coiled/benchmarks/issues/1515#issuecomment-2131212969, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTAF45GE2JPPFUI3CSTZEBVKLAVCNFSM6AAAAABIBQVFS6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZRGIYTEOJWHE . You are receiving this because you are subscribed to this thread.Message ID: @.***>

--

https://coiled.io

Matthew Rocklin CEO, Dask Maintainer

fjetter commented 1 month ago

Yes, join ordering is something that is done manually to some extend right now. We've been transparent about this when publishing the results, see https://docs.coiled.io/blog/tpch.html#dask There is an open issue to support automatic join reordering in dask-expr here https://github.com/dask/dask-expr/issues/1065

The query you are pointing to is Q9. This is actually an interesting example. The functional SQL definition is

select
        nation,
        o_year,
        round(sum(amount), 2) as sum_profit
    from
        (
            select
                n_name as nation,
                year(o_orderdate) as o_year,
                l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity as amount
            from
                part,
                supplier,
                lineitem,
                partsupp,
                orders,
                nation
            where
                s_suppkey = l_suppkey
                and ps_suppkey = l_suppkey
                and ps_partkey = l_partkey
                and p_partkey = l_partkey
                and o_orderkey = l_orderkey
                and s_nationkey = n_nationkey
                and p_name like '%green%'
        ) as profit
    group by
        nation,
        o_year
    order by
        nation,
        o_year desc

Following this very strictly would mean you'd have to join part to supplier first. However, there is no relation defined between those two tables. The relation is introduced via partsupp which only comes at fourth position so the ordering has to be amended to even make sense for an imperative API. The relations between the tables are defined in the where-clause so we could instead also follow the where-clause ordering which would again look differently. I think the polars implementation is a subtle variation of the where-clause ordering but that may be accidental.

In this example, the reordering was required to make sense of the query. There are other examples where we did reorder manually to make the query faster. One example is Q18 where we started off with the same ordering as the tables listed in the from clause of the functional SQL. In this example, polars uses an ordering that is also different to how the tables are listed in SQL and we adopted this here instead for improved performance. There are a couple of other examples around where we looked at query ordering ourselves. For example, Q21 which is already pretty tricky to express and requires some interpretation due to using multiple correlated subqueries with self joins and I honestly don't know what "true" SQL ordering would look like.

So, you are very right. We are comparing apples to peaches again because we are comparing a declarative language to an imperative API. There is no universal, unambiguous set of rules (that I am aware of) that translates SQL to a DataFrame API. This example shows that even with a join order optimizer as part of the DataFrame API, this ambiguity still exists to some extend.

I'm sorry if this feels unfair to polars as the only other DataFrame API library. We just used the polars code that was in your repo at the time but haven't gone through optimizing it. I'm not sure how to do this better.

ritchie46 commented 3 weeks ago

Yes, join ordering is something that is done manually to some extend right now. We've been transparent about this when publishing the results, see https://docs.coiled.io/blog/tpch.html#dask

A comment doesn't reach as far as a graph. ;) I just modified the q9 according to the dask query ordering and saw a 2.5x improvement on the Polars side.

I agree that doing fair benchmark between a declarative vs imperative API is hard. However, what I do think is important is to do the same wherever it is possible.

I'm sorry if this feels unfair to polars as the only other DataFrame API library. We just used the polars code that was in your repo at the time but haven't gone through optimizing it. I'm not sure how to do this better.

We can start by applying the same join orderings for the DataFrame API's, if you do a manual join reordering for Dask you should do it for Polars as well, and I don't think a comment is enough here as that means the chart will be misleading and those will be often shared without context.

I can do a pass in the Polars queries upstream to make them more similar to the Dask ones. And I think any hand written optimization should be tried to applied to other tools whenever possible. Upstream we will make an effort to do the same.