Coiled Benchmark results

phofl commented 1 year ago

A new run is here: https://github.com/coiled/benchmarks/actions/runs/5957150408

It looks like we are consistently a little bit slower on shuffle based workloads, I wouldn't look into this right now. The h2o results look mostly good, filter is a bit weird, but this could be because of some time series problem. Will look into that one.

Thoughts here? @mrocklin @rjzamora

mrocklin commented 1 year ago

It looks like we are consistently a little bit slower on shuffle based workloads

I would expect things to be a bit slower when large graphs are involved because we're shipping the concrete graph while dask.dataframe is shipping an HLG. I'd expect that problem to solve itself over time as we ship exprs instead (probably waiting for Florian to come back for that?)

but this could be because of some time series problem

I'd also expect this to solve itself when we ship exprs.

mrocklin commented 1 year ago

The h2o results look mostly good, filter is a bit weird

I'd say they seem good but not great. At this point I'd probably run them both live and look at a dask dashboard to get a better understanding of what's taking up most of the time.

phofl commented 1 year ago

I can do that but I don't expect much. The code already subselects the columns that we need, so I don't expect a big performance improvement there.

mrocklin commented 1 year ago

Ah, right. I was looking at the tests and not seeing anything in the columns= keyword, but I forgot that dask.dataframe does an optimization if a column selection is the very first operation

def test_q3(ddf):
    ddf = ddf[["id3", "v1", "v3"]]  # <---------- this line is atypical
    (
        ddf.groupby("id3", dropna=False, observed=True)
        .agg({"v1": "sum", "v3": "mean"})
        .compute()
    )

Maybe we remove those lines from benchmarks and see how things compare?

phofl commented 1 year ago

I am currently generating data for tpch with 100GB and 200GB, I'll assume that this will get us a clearer picture, the h2o benchmarks are relatively short.

mrocklin commented 1 year ago

👍

On Thu, Aug 24, 2023 at 8:36 AM Patrick Hoefler @.***> wrote:

I am currently generating data for tpch with 100GB and 200GB, I'll assume that this will get us a clearer picture, the h2o benchmarks are relatively short.

— Reply to this email directly, view it on GitHub https://github.com/dask-contrib/dask-expr/issues/277#issuecomment-1691692004, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTD4MNDR555CY6JMGXLXW5KF5ANCNFSM6AAAAAA34URY3E . You are receiving this because you were mentioned.Message ID: @.***>

rjzamora commented 1 year ago

Maybe we remove those lines from benchmarks and see how things compare?

Dask shouldn't need the immediate column projection anymore. The groupby-aggregation code will add it anyway.

phofl commented 11 months ago

Here is a successful TPCH benchmark run:

https://github.com/coiled/benchmarks/actions/runs/6239847808

Results are very promising

mrocklin commented 11 months ago

Are these the right charts to look at?

If so, that's a lot of blue.

mrocklin commented 11 months ago

I'll be curious to see how it compares to other projects though. We know that dask dataframe is bad at many of these benchmarks. Comparing against spark or polars or something will, I suspect, provide a more critical viewpoint. (not to stop us from celebrating a big win here though)

phofl commented 11 months ago

Yep those are correct :)

The peak memory charts show similar results btw.

phofl commented 11 months ago

Polars will be terrible at this, their cloud connectors aren't very good yet, which makes them very slow at reading our parquet files. But Benchmarking against Spark makes sense

mrocklin commented 11 months ago

Polars will be terrible at this, their cloud connectors aren't very good yet, which makes them very slow at reading our parquet files

It would be interesting to find something comparable, maybe this is dask-expr against the datasets that they use locally on single machines for their benchmarks.

phofl commented 11 months ago

Would you run this on a huge machine? That would remove the shuffling component completely which makes interpreting the results hard

mrocklin commented 11 months ago

I guess I'm curious how dask-expr would compare against polars on polars' home turf, which I'm guessing is well-represented by the context at https://www.pola.rs/benchmarks.html . I'm guessing that this is on a much smaller dataset. I'm also guessing that polars will do quite a bit better. However, I'm hopeful that we can get to a point where we're a constant factor worse (2-5x?) I suspect that cases where we're much much worse will highlight optimizations that we have not yet considered.

Not a huge deal though.

mrocklin commented 11 months ago

Would you run this on a huge machine?

This also isn't unreasonable. Crafting a "Polars Marketing" hat for myself I'll say "why go through the bother of a distributed memory cluster when you can solve really big problems with a combination of streaming algorithms and big machines?"

phofl commented 11 months ago

I'll give it a shot (probably not before my vacation though). Probably depends mostly on how well polars scales when having a lot of cores available

phofl commented 11 months ago

I think we should also benchmark with a local Dask cluster on a big machine then

mrocklin commented 11 months ago

(probably not before my vacation though)

Not a big rush. I also don't have enough conviction here to prioritize this above other things.

phofl commented 11 months ago

Here is a run with 1TB:

https://github.com/coiled/benchmarks/actions/runs/6275077913

Couple of observations:

query 4 won't pass at all on dask/dask (memory spike that kills workers)
query 5 is running out of disk space during shuffle with dask/dask
Generally, getting dask-expr running was very easy, it took me a while to get dask/dask mostly passing simply because it requires much more resources.

Summarising, we are not only faster, we can also deal with bigger datasets way more easily

mrocklin commented 11 months ago

That's awesome. Two questions come to mind:

How hard would 10TB be? How about if we didn't care about dask/dask?
I'm looking for coiled dashboards. I'm guessing these work?
1. dask/dask: https://cloud.coiled.io/clusters/276946/information?account=dask-engineering
2. dask-expr: https://cloud.coiled.io/clusters/276944/information?account=dask-engineering

phofl commented 11 months ago

10TB for dask-expr should be relatively easy, dask/dask would require a pretty big cluster (I am guessing here, didn't try that yet) and more disk space then I was using

I'd recommend these two:

That's from the run that created the benchmarks

mrocklin commented 11 months ago

Same question, but s/10TB/100TB/

phofl commented 11 months ago

I think we run into problems with query 4 at some point for dask-expr as well. There is a drop_duplicates in there that's inefficient at the moment. Shouldn't be too hard apart from that.

The tpch data generator failed when I tried creating a set for 100TB though, so I don't know if that's feasible or if we have to look for something else.

mrocklin commented 11 months ago

I would also welcome thoughts from folks on how to optimize this further. HashJoins take up more than half the time. Some thoughts:

Faster disk?
Join reordering (@phofl maybe something for us to think about)
Network / GIL contention? (chart below)

Maybe some of this waits for @hendrikmakait to come back from vacation, but pinging @fjetter in case he has ideas that could be explored by folks in the meantime.

phofl commented 11 months ago

I think the joins are already ordered as efficiently as possible in that example.

I'll try this with pandas 2.1.1 again, I had a couple of PRs that should reduce Gil contention generally speaking, not sure if this is helpful in our case.

mrocklin commented 11 months ago

I had a couple of PRs that should reduce Gil contention generally speaking

My guess was that the GIL contention was due to P2P. I don't know enough to have confidence though.

phofl commented 11 months ago

Could very well be the case. But it could also be partially related to merge in pandas, that didn't release the Gil before 2.1 if you were merging on string columns. But that's mostly guessing

mrocklin commented 11 months ago

OK, cool.

On Fri, Sep 22, 2023 at 11:17 AM Patrick Hoefler @.***> wrote:

Could very well be the case. But it could also be partially related to merge in pandas, that didn't release the Gil before 2.1 if you were merging on string columns. But that's mostly guessing

— Reply to this email directly, view it on GitHub https://github.com/dask-contrib/dask-expr/issues/277#issuecomment-1731694036, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTD2GEWT7UWRHVMHG3LX3W23LANCNFSM6AAAAAA34URY3E . You are receiving this because you were mentioned.Message ID: @.***>

phofl commented 10 months ago

Just FYI: We are now running the tpch queries over in coiled-benchmarks with dask-expr instead of dask/dask. I want to have some guarantee that we don't introduce performance regressions

mrocklin commented 10 months ago

@phofl I'm curious, when we last spoke you mentioned that you were going to look at why some merge queries were slow. You suggested that using broadcast joins is probably the reason. Any progress there?

dask / dask-expr

Coiled Benchmark results #277