coiled / benchmarks

BSD 3-Clause "New" or "Revised" License
27 stars 17 forks source link

Imbalance join workflow #882

Open fjetter opened 1 year ago

fjetter commented 1 year ago

See also https://github.com/coiled/benchmarks/issues/652

In collaboration with a prospective dask user we were able to create a script that generates synthetic data similar to what they would be using in a production ETL job, see https://github.com/coiled/imbalanced-join

This data represents a rather skewed but realistic dataset that dask has been struggling with in the past significantly. Adding this to our workflows benchmarks would allow us to assess if the situation improved by now and investigate what needs to be improved to allow for a successful and fast execution of this kind of data.

As a first increment we should add it to the benchmarks CI and quickly reassess performance. With the introduction of P2P shuffling I expect very different outcomes (but chances are still high it could fail).

Note: The datasets are likely already present in s3://test-imbalanced-join/df1 and s3://test-imbalanced-join/df2

ncclementi commented 1 year ago

Note: The datasets are likely already present in s3://test-imbalanced-join/df1 and s3://test-imbalanced-join/df2

Yes, they are. Ping me if you need any extra context on this.