Closed djouallah closed 1 year ago
I will take a look.
Working on it now. I am not sure whether it is regression or those two queries are always slow.
The bottle neck of q17 should be Aggregation.
A profile run using flamegraph shows on my machine:
arrow_cast::cast::cast_decimal_to_decimal
is consuming about 1/4 of the time of q17.
FYI @viirya
One reason for so many downcast_value
call is because the grouping column l_partkey
is with high cardinality, causing the vectorization is almost useless.
The most expensive part is the line let values = &cast(values, sum_type)?
in sum_batch
which performs this casting.
I guess we should move that evaluation (or other parts of sum_batch
as well) up, so it will be done before the grouping, rather than after, so it still is vectorized.
Another observation I have is that the plan does some unnecessary casting:
Projection: CAST(SUM(lineitem.l_extendedprice) AS Float64) / Float64(7) AS avg_yearly
Aggregate: groupBy=[[]], aggr=[[SUM(lineitem.l_extendedprice)]]
Projection: lineitem.l_extendedprice
Inner Join: part.p_partkey = __scalar_sq_3.l_partkey Filter: CAST(lineitem.l_quantity AS Decimal128(30, 15)) < CAST(__scalar_sq_3.__value AS Decimal128(30, 15))
Projection: lineitem.l_quantity, lineitem.l_extendedprice, part.p_partkey
Inner Join: lineitem.l_partkey = part.p_partkey
TableScan: lineitem projection=[l_partkey, l_quantity, l_extendedprice]
Projection: part.p_partkey
Filter: part.p_brand = Utf8("Brand#23") AND part.p_container = Utf8("MED BOX")
TableScan: part projection=[p_partkey, p_brand, p_container], partial_filters=[part.p_brand = Utf8("Brand#23"), part.p_container = Utf8("MED BOX")]
SubqueryAlias: __scalar_sq_3
Projection: lineitem.l_partkey, Float64(0.2) * CAST(AVG(lineitem.l_quantity) AS Float64) AS __value
Aggregate: groupBy=[[lineitem.l_partkey]], aggr=[[AVG(lineitem.l_quantity)]]
TableScan: lineitem projection=[l_partkey, l_quantity]
sum_batch
)__scalar_sq_3.__value
. I guess because 0.2 is assumed to be a float.Another observation I have is that the plan does some unnecessary casting:
@Dandandan , Yes, I also notice this problem, it's related with type coercion
.
It isn't a easy problem, type coercion
in pg is a top down
process, so top can request type from children.
It's similar with interesting order
/enforcement
. But current datafusion do type coercion
like mysql bottom up
, it isn't good enough.
I prepare to improve type coercion in the future according to PG and Spark.
BTW, #5831 also is related with this q17, it move cast from expression eval into subplan.
@jackwener Nice, thank you
@jackwener this is all resolved now, right?
datafusion is doing a great progress but still, it is not solved Version: 26.0.0
I am still working on it.
Query 18 should is also considerably faster in the next DataFusion version, because of join-related improvements (datastructure improvement and vectorized collision checks). I'll be looking into what we can do further.
One suggestion that will yield some (smaller) performance improvement for query 18 (and most other queries): https://github.com/apache/arrow-datafusion/issues/6768
There is quite a lot of recent work / proposals to make the grouping significantly faster for these queries. See https://github.com/apache/arrow-datafusion/issues/4973
using Python_datafusion 27, unfortunately still issues with Query 17, my VM has 8 cores and 64 GB of RAM, Query 17 got OOM
I expect Q17 to go about 2x faster and use much less memory when we merge our most recent work -- see https://github.com/apache/arrow-datafusion/pull/6800#issuecomment-1627412684 for details
i am doing this experimentation using fabric notebook, datafusion doing alright, would love really to start seeing numbers for 8 cores, as currently with 8 cores has 64 GB yet DF has an OOM :(
Thanks @djouallah - the new GroupHashAggregate
approach will also (drastically) reduce memory usage.
BTW I think the reason DF's memory usage is increasing with number of cores is because the first partial aggregate phase is using RoundRobin repartitioning (and thus each hash table has an entry for all the groups).
To avoid this, we would need to hash repartition the input based on group keys so the different partitions saw different subsets of the group keys
@alamb the graph show the overall duration to finish toch_sf100 based on the number of cores, Datafusion is faster than spark even when using only 1 VM ;)
If you get a chance to test with the latest datafusion (will be in 28.0.0, eta probably next week) I expect performance for high cardinality grouping to be much better due to https://github.com/apache/arrow-datafusion/pull/6904
BTW I think the reason DF's memory usage is increasing with number of cores is because the first partial aggregate phase is using RoundRobin repartitioning (and thus each hash table has an entry for all the groups).
I wrote up an issue describing this here: https://github.com/apache/arrow-datafusion/issues/6937
using version 28, query 8 start getting errors
https://colab.research.google.com/drive/1KzofqAWJxVTboNcywGxSbIgLNatkIsM2
Query8
Arrow error: Compute error: Overflow happened on: 136581719431 * 100000000000000000000000000000000000000
Sounds like https://github.com/apache/arrow-datafusion/issues/6794 Cc @viirya
good job btw for Query 17 and 18, unfortunately when the RAM is limited, still Datafusion get OOM for Query 18
Sounds like #6794 Cc @viirya
I suppose that decimal multiplication and division precision change at #6832 will fix that.
good job btw for Query 17 and 18, unfortunately when the RAM is limited, still Datafusion get OOM for Query 18
Nice, getting close to DuckDB
. hyper
is incredible for some queries!
@Dandandan specially when using their native format, Hyper just literally don't care about RAM, i use it with the free colab, and did finish tpch_sf110 !!! just fine, what do you think they are doing different ?
@Dandandan specially when using their native format, Hyper just literally don't care about RAM, i use it with the free colab, and did finish tpch_sf110 !!! just fine, what do you think they are doing different ?
Hard to say in general, but they do some optimizations we don't do or do better planning (e.g. for join selection). Using a different format instead of parquet might help as well as parquet can be slower to decode / decompress than formats optimized for query performance.
!! just fine, what do you think they are doing different ?
It is also probably good to point out that hyper is largely a research system and one of the standard benchmark sets that is used for research systems is TPCH
Thus I suspect a lot of effort has gone into making the patterns that appear in TPCH very fast (e.g. left deep join trees with very selective predicates). That is not to say that the optimizations are entirely TPCH specific, but it wouldn't surprise me if in general purpose use DataFusion performance much closer (or better)
!! just fine, what do you think they are doing different ?
It is also probably good to point out that hyper is largely a research system and one of the standard benchmark sets that is used for research systems is TPCH
Thus I suspect a lot of effort has gone into making the patterns that appear in TPCH very fast (e.g. left deep join trees with very selective predicates). That is not to say that the optimizations are entirely TPCH specific, but it wouldn't surprise me if in general purpose use DataFusion performance much closer (or better)
Hyper-db is developed by Tableau now, so it probably has some improvements over the last years compared to the "research system": Some release notes for last couple of years: https://tableau.github.io/hyper-db/docs/releases
it seems there is a regression with query 18, it used to works fine with tpch100 using 124 GB of RAM, now the notebook crashes when using DF31 !!!
edit : never mind, it was a temporary glitch,
Given the long history of this issue, I think it is hard to understand what, if anything, it is tracking. I suggest we close it and file another issue to continue discussing additional performance improvements
the issue is very clear, performance of Query 17 and 18 is still very slow compared to other in process engines !!
Btw, DF31 is making a great progress, my impression as long as the data fit in memory, the performance is very similar to DuckDB, here I am reading Parquet files from an Azure storage, the main issue start with core 8 which has 64 of GB, Query 18 crash the notebook
BTW I think the core problem here is that DataFusion's parallel hash grouping builds the entire hash table for each input partition -- thus it requires memory proportional to the number of cores. This is tracked more in https://github.com/apache/arrow-datafusion/issues/6937
I spent some time analyzing why Q17
and Q18
are slow in detail this morning (in the context of #6782 ). My analysis shows we could close most of the gap with a better join order:
The issue seems to be (at least partially) related to FilterExec
returning unknown statistics in case of unsupported filter predicates -- I wonder, if it would be better / more correct to rely on worth-case scenario for such filters, and simply propagate input statistics -- it seems to be enough to fix Q17 plan.
I wonder, if it would be better / more correct to rely on worth-case scenario for such filters, and simply propagate input statistics
Perhaps the filter can simply switch to Precision::Inexact
when it can't analyze the selectivity for the expression
Another thing I have seen in the past is heuristically pick a constant selectivity (assume it filters 1/2 the rows). However I think this leads to non-robust plans (sometimes the plans are good, sometimes they are bad, and it is hard to predict when each is hit)
I filed https://github.com/apache/arrow-datafusion/issues/8078 with a proposal of a more precise way to represent inexact statistics
Thank you very much for your works, I am happy with datafusion 33 performance, now it does finish TPCH_SF100 using 64 GB of RAM in Fabric
was running a TPCH_SF5 just for fun, I notice query 17 and 18 are very slow
full reproducible example
another issue when I increase sf to 10, I start getting OOM errors ?
https://colab.research.google.com/drive/1WJ2ICxJyAYClkDx8guGX-TcOMnf8SBxr#scrollTo=z494Cl6XKUVX
DataFusion 26 gives the following result against duckdb