Open phofl opened 1 year ago
It looks like we are consistently a little bit slower on shuffle based workloads
I would expect things to be a bit slower when large graphs are involved because we're shipping the concrete graph while dask.dataframe is shipping an HLG. I'd expect that problem to solve itself over time as we ship exprs instead (probably waiting for Florian to come back for that?)
but this could be because of some time series problem
I'd also expect this to solve itself when we ship exprs.
The h2o results look mostly good, filter is a bit weird
I'd say they seem good but not great. At this point I'd probably run them both live and look at a dask dashboard to get a better understanding of what's taking up most of the time.
I can do that but I don't expect much. The code already subselects the columns that we need, so I don't expect a big performance improvement there.
Ah, right. I was looking at the tests and not seeing anything in the columns=
keyword, but I forgot that dask.dataframe does an optimization if a column selection is the very first operation
def test_q3(ddf):
ddf = ddf[["id3", "v1", "v3"]] # <---------- this line is atypical
(
ddf.groupby("id3", dropna=False, observed=True)
.agg({"v1": "sum", "v3": "mean"})
.compute()
)
Maybe we remove those lines from benchmarks and see how things compare?
I am currently generating data for tpch with 100GB and 200GB, I'll assume that this will get us a clearer picture, the h2o benchmarks are relatively short.
👍
On Thu, Aug 24, 2023 at 8:36 AM Patrick Hoefler @.***> wrote:
I am currently generating data for tpch with 100GB and 200GB, I'll assume that this will get us a clearer picture, the h2o benchmarks are relatively short.
— Reply to this email directly, view it on GitHub https://github.com/dask-contrib/dask-expr/issues/277#issuecomment-1691692004, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTD4MNDR555CY6JMGXLXW5KF5ANCNFSM6AAAAAA34URY3E . You are receiving this because you were mentioned.Message ID: @.***>
Maybe we remove those lines from benchmarks and see how things compare?
Dask shouldn't need the immediate column projection anymore. The groupby-aggregation code will add it anyway.
Here is a successful TPCH benchmark run:
https://github.com/coiled/benchmarks/actions/runs/6239847808
Results are very promising
Are these the right charts to look at?
If so, that's a lot of blue.
I'll be curious to see how it compares to other projects though. We know that dask dataframe is bad at many of these benchmarks. Comparing against spark or polars or something will, I suspect, provide a more critical viewpoint. (not to stop us from celebrating a big win here though)
Yep those are correct :)
The peak memory charts show similar results btw.
Polars will be terrible at this, their cloud connectors aren't very good yet, which makes them very slow at reading our parquet files. But Benchmarking against Spark makes sense
Polars will be terrible at this, their cloud connectors aren't very good yet, which makes them very slow at reading our parquet files
It would be interesting to find something comparable, maybe this is dask-expr against the datasets that they use locally on single machines for their benchmarks.
Would you run this on a huge machine? That would remove the shuffling component completely which makes interpreting the results hard
I guess I'm curious how dask-expr would compare against polars on polars' home turf, which I'm guessing is well-represented by the context at https://www.pola.rs/benchmarks.html . I'm guessing that this is on a much smaller dataset. I'm also guessing that polars will do quite a bit better. However, I'm hopeful that we can get to a point where we're a constant factor worse (2-5x?) I suspect that cases where we're much much worse will highlight optimizations that we have not yet considered.
Not a huge deal though.
Would you run this on a huge machine?
This also isn't unreasonable. Crafting a "Polars Marketing" hat for myself I'll say "why go through the bother of a distributed memory cluster when you can solve really big problems with a combination of streaming algorithms and big machines?"
I'll give it a shot (probably not before my vacation though). Probably depends mostly on how well polars scales when having a lot of cores available
I think we should also benchmark with a local Dask cluster on a big machine then
(probably not before my vacation though)
Not a big rush. I also don't have enough conviction here to prioritize this above other things.
Here is a run with 1TB:
https://github.com/coiled/benchmarks/actions/runs/6275077913
Couple of observations:
Summarising, we are not only faster, we can also deal with bigger datasets way more easily
That's awesome. Two questions come to mind:
10TB for dask-expr should be relatively easy, dask/dask would require a pretty big cluster (I am guessing here, didn't try that yet) and more disk space then I was using
I'd recommend these two:
That's from the run that created the benchmarks
Same question, but s/10TB/100TB/
I think we run into problems with query 4 at some point for dask-expr as well. There is a drop_duplicates in there that's inefficient at the moment. Shouldn't be too hard apart from that.
The tpch data generator failed when I tried creating a set for 100TB though, so I don't know if that's feasible or if we have to look for something else.
I would also welcome thoughts from folks on how to optimize this further. HashJoins take up more than half the time. Some thoughts:
Maybe some of this waits for @hendrikmakait to come back from vacation, but pinging @fjetter in case he has ideas that could be explored by folks in the meantime.
I think the joins are already ordered as efficiently as possible in that example.
I'll try this with pandas 2.1.1 again, I had a couple of PRs that should reduce Gil contention generally speaking, not sure if this is helpful in our case.
I had a couple of PRs that should reduce Gil contention generally speaking
My guess was that the GIL contention was due to P2P. I don't know enough to have confidence though.
Could very well be the case. But it could also be partially related to merge in pandas, that didn't release the Gil before 2.1 if you were merging on string columns. But that's mostly guessing
OK, cool.
On Fri, Sep 22, 2023 at 11:17 AM Patrick Hoefler @.***> wrote:
Could very well be the case. But it could also be partially related to merge in pandas, that didn't release the Gil before 2.1 if you were merging on string columns. But that's mostly guessing
— Reply to this email directly, view it on GitHub https://github.com/dask-contrib/dask-expr/issues/277#issuecomment-1731694036, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTD2GEWT7UWRHVMHG3LX3W23LANCNFSM6AAAAAA34URY3E . You are receiving this because you were mentioned.Message ID: @.***>
Just FYI: We are now running the tpch queries over in coiled-benchmarks with dask-expr instead of dask/dask. I want to have some guarantee that we don't introduce performance regressions
@phofl I'm curious, when we last spoke you mentioned that you were going to look at why some merge queries were slow. You suggested that using broadcast joins is probably the reason. Any progress there?
A new run is here: https://github.com/coiled/benchmarks/actions/runs/5957150408
It looks like we are consistently a little bit slower on shuffle based workloads, I wouldn't look into this right now. The h2o results look mostly good, filter is a bit weird, but this could be because of some time series problem. Will look into that one.
Thoughts here? @mrocklin @rjzamora