TPCH, Query 18 and 17 very slow

djouallah commented 1 year ago

was running a TPCH_SF5 just for fun, I notice query 17 and 18 are very slow

full reproducible example

another issue when I increase sf to 10, I start getting OOM errors ?

https://colab.research.google.com/drive/1WJ2ICxJyAYClkDx8guGX-TcOMnf8SBxr#scrollTo=z494Cl6XKUVX

DataFusion 26 gives the following result against duckdb

mingmwang commented 1 year ago

I will take a look.

mingmwang commented 1 year ago

Working on it now. I am not sure whether it is regression or those two queries are always slow.

mingmwang commented 1 year ago

The bottle neck of q17 should be Aggregation.

Dandandan commented 1 year ago

A profile run using flamegraph shows on my machine:

arrow_cast::cast::cast_decimal_to_decimal is consuming about 1/4 of the time of q17.

Dandandan commented 1 year ago

flamegraph

Dandandan commented 1 year ago

FYI @viirya

mingmwang commented 1 year ago

One reason for so many downcast_value call is because the grouping column l_partkey is with high cardinality, causing the vectorization is almost useless.

Dandandan commented 1 year ago

The most expensive part is the line let values = &cast(values, sum_type)? in sum_batch which performs this casting.

I guess we should move that evaluation (or other parts of sum_batch as well) up, so it will be done before the grouping, rather than after, so it still is vectorized.

Dandandan commented 1 year ago

Another observation I have is that the plan does some unnecessary casting:

Projection: CAST(SUM(lineitem.l_extendedprice) AS Float64) / Float64(7) AS avg_yearly
  Aggregate: groupBy=[[]], aggr=[[SUM(lineitem.l_extendedprice)]]
    Projection: lineitem.l_extendedprice
      Inner Join: part.p_partkey = __scalar_sq_3.l_partkey Filter: CAST(lineitem.l_quantity AS Decimal128(30, 15)) < CAST(__scalar_sq_3.__value AS Decimal128(30, 15))
        Projection: lineitem.l_quantity, lineitem.l_extendedprice, part.p_partkey
          Inner Join: lineitem.l_partkey = part.p_partkey
            TableScan: lineitem projection=[l_partkey, l_quantity, l_extendedprice]
            Projection: part.p_partkey
              Filter: part.p_brand = Utf8("Brand#23") AND part.p_container = Utf8("MED BOX")
                TableScan: part projection=[p_partkey, p_brand, p_container], partial_filters=[part.p_brand = Utf8("Brand#23"), part.p_container = Utf8("MED BOX")]
        SubqueryAlias: __scalar_sq_3
          Projection: lineitem.l_partkey, Float64(0.2) * CAST(AVG(lineitem.l_quantity) AS Float64) AS __value
            Aggregate: groupBy=[[lineitem.l_partkey]], aggr=[[AVG(lineitem.l_quantity)]]
              TableScan: lineitem projection=[l_partkey, l_quantity]

inside the aggregation (in sum_batch)
cast to float for __scalar_sq_3.__value. I guess because 0.2 is assumed to be a float.
cast back to decimal for join filter

mingmwang commented 1 year ago

@Dandandan https://github.com/apache/arrow-datafusion/pull/5866

jackwener commented 1 year ago

Another observation I have is that the plan does some unnecessary casting:

@Dandandan , Yes, I also notice this problem, it's related with type coercion.

It isn't a easy problem, type coercion in pg is a top down process, so top can request type from children. It's similar with interesting order/enforcement. But current datafusion do type coercion like mysql bottom up, it isn't good enough. I prepare to improve type coercion in the future according to PG and Spark.

BTW, #5831 also is related with this q17, it move cast from expression eval into subplan.

Dandandan commented 1 year ago

@jackwener Nice, thank you

Dandandan commented 1 year ago

@jackwener this is all resolved now, right?

djouallah commented 1 year ago

datafusion is doing a great progress but still, it is not solved Version: 26.0.0

mingmwang commented 1 year ago

I am still working on it.

Dandandan commented 1 year ago

Query 18 should is also considerably faster in the next DataFusion version, because of join-related improvements (datastructure improvement and vectorized collision checks). I'll be looking into what we can do further.

Dandandan commented 1 year ago

One suggestion that will yield some (smaller) performance improvement for query 18 (and most other queries): https://github.com/apache/arrow-datafusion/issues/6768

alamb commented 1 year ago

There is quite a lot of recent work / proposals to make the grouping significantly faster for these queries. See https://github.com/apache/arrow-datafusion/issues/4973

djouallah commented 1 year ago

using Python_datafusion 27, unfortunately still issues with Query 17, my VM has 8 cores and 64 GB of RAM, Query 17 got OOM

alamb commented 1 year ago

I expect Q17 to go about 2x faster and use much less memory when we merge our most recent work -- see https://github.com/apache/arrow-datafusion/pull/6800#issuecomment-1627412684 for details

djouallah commented 1 year ago

i am doing this experimentation using fabric notebook, datafusion doing alright, would love really to start seeing numbers for 8 cores, as currently with 8 cores has 64 GB yet DF has an OOM :(

Dandandan commented 1 year ago

Thanks @djouallah - the new GroupHashAggregate approach will also (drastically) reduce memory usage.

alamb commented 1 year ago

BTW I think the reason DF's memory usage is increasing with number of cores is because the first partial aggregate phase is using RoundRobin repartitioning (and thus each hash table has an entry for all the groups).

To avoid this, we would need to hash repartition the input based on group keys so the different partitions saw different subsets of the group keys

djouallah commented 1 year ago

@alamb the graph show the overall duration to finish toch_sf100 based on the number of cores, Datafusion is faster than spark even when using only 1 VM ;)

alamb commented 1 year ago

If you get a chance to test with the latest datafusion (will be in 28.0.0, eta probably next week) I expect performance for high cardinality grouping to be much better due to https://github.com/apache/arrow-datafusion/pull/6904

alamb commented 1 year ago

BTW I think the reason DF's memory usage is increasing with number of cores is because the first partial aggregate phase is using RoundRobin repartitioning (and thus each hash table has an entry for all the groups).

I wrote up an issue describing this here: https://github.com/apache/arrow-datafusion/issues/6937

djouallah commented 1 year ago

using version 28, query 8 start getting errors

https://colab.research.google.com/drive/1KzofqAWJxVTboNcywGxSbIgLNatkIsM2

Query8
Arrow error: Compute error: Overflow happened on: 136581719431 * 100000000000000000000000000000000000000

Dandandan commented 1 year ago

Sounds like https://github.com/apache/arrow-datafusion/issues/6794 Cc @viirya

djouallah commented 1 year ago

good job btw for Query 17 and 18, unfortunately when the RAM is limited, still Datafusion get OOM for Query 18

viirya commented 1 year ago

Sounds like #6794 Cc @viirya

I suppose that decimal multiplication and division precision change at #6832 will fix that.

Dandandan commented 1 year ago

good job btw for Query 17 and 18, unfortunately when the RAM is limited, still Datafusion get OOM for Query 18

Nice, getting close to DuckDB. hyper is incredible for some queries!

djouallah commented 1 year ago

@Dandandan specially when using their native format, Hyper just literally don't care about RAM, i use it with the free colab, and did finish tpch_sf110 !!! just fine, what do you think they are doing different ?

Dandandan commented 1 year ago

@Dandandan specially when using their native format, Hyper just literally don't care about RAM, i use it with the free colab, and did finish tpch_sf110 !!! just fine, what do you think they are doing different ?

Hard to say in general, but they do some optimizations we don't do or do better planning (e.g. for join selection). Using a different format instead of parquet might help as well as parquet can be slower to decode / decompress than formats optimized for query performance.

alamb commented 1 year ago

!! just fine, what do you think they are doing different ?

It is also probably good to point out that hyper is largely a research system and one of the standard benchmark sets that is used for research systems is TPCH

Thus I suspect a lot of effort has gone into making the patterns that appear in TPCH very fast (e.g. left deep join trees with very selective predicates). That is not to say that the optimizations are entirely TPCH specific, but it wouldn't surprise me if in general purpose use DataFusion performance much closer (or better)

Dandandan commented 1 year ago

!! just fine, what do you think they are doing different ?

It is also probably good to point out that hyper is largely a research system and one of the standard benchmark sets that is used for research systems is TPCH

Thus I suspect a lot of effort has gone into making the patterns that appear in TPCH very fast (e.g. left deep join trees with very selective predicates). That is not to say that the optimizations are entirely TPCH specific, but it wouldn't surprise me if in general purpose use DataFusion performance much closer (or better)

Hyper-db is developed by Tableau now, so it probably has some improvements over the last years compared to the "research system": Some release notes for last couple of years: https://tableau.github.io/hyper-db/docs/releases

djouallah commented 1 year ago

it seems there is a regression with query 18, it used to works fine with tpch100 using 124 GB of RAM, now the notebook crashes when using DF31 !!!

edit : never mind, it was a temporary glitch,

alamb commented 1 year ago

Given the long history of this issue, I think it is hard to understand what, if anything, it is tracking. I suggest we close it and file another issue to continue discussing additional performance improvements

djouallah commented 1 year ago

the issue is very clear, performance of Query 17 and 18 is still very slow compared to other in process engines !!

djouallah commented 1 year ago

Btw, DF31 is making a great progress, my impression as long as the data fit in memory, the performance is very similar to DuckDB, here I am reading Parquet files from an Azure storage, the main issue start with core 8 which has 64 of GB, Query 18 crash the notebook

alamb commented 1 year ago

BTW I think the core problem here is that DataFusion's parallel hash grouping builds the entire hash table for each input partition -- thus it requires memory proportional to the number of cores. This is tracked more in https://github.com/apache/arrow-datafusion/issues/6937

alamb commented 1 year ago

I spent some time analyzing why Q17 and Q18 are slow in detail this morning (in the context of #6782 ). My analysis shows we could close most of the gap with a better join order:

korowa commented 1 year ago

The issue seems to be (at least partially) related to FilterExec returning unknown statistics in case of unsupported filter predicates -- I wonder, if it would be better / more correct to rely on worth-case scenario for such filters, and simply propagate input statistics -- it seems to be enough to fix Q17 plan.

alamb commented 1 year ago

I wonder, if it would be better / more correct to rely on worth-case scenario for such filters, and simply propagate input statistics

Perhaps the filter can simply switch to Precision::Inexact when it can't analyze the selectivity for the expression

Another thing I have seen in the past is heuristically pick a constant selectivity (assume it filters 1/2 the rows). However I think this leads to non-robust plans (sometimes the plans are good, sometimes they are bad, and it is hard to predict when each is hit)

alamb commented 1 year ago

I filed https://github.com/apache/arrow-datafusion/issues/8078 with a proposal of a more precise way to represent inexact statistics

djouallah commented 1 year ago

Thank you very much for your works, I am happy with datafusion 33 performance, now it does finish TPCH_SF100 using 64 GB of RAM in Fabric

apache / datafusion

TPCH, Query 18 and 17 very slow #5646