In collaboration with a prospective dask user we were able to create a script that generates synthetic data similar to what they would be using in a production ETL job, see https://github.com/coiled/imbalanced-join
This data represents a rather skewed but realistic dataset that dask has been struggling with in the past significantly. Adding this to our workflows benchmarks would allow us to assess if the situation improved by now and investigate what needs to be improved to allow for a successful and fast execution of this kind of data.
As a first increment we should add it to the benchmarks CI and quickly reassess performance. With the introduction of P2P shuffling I expect very different outcomes (but chances are still high it could fail).
Note: The datasets are likely already present in s3://test-imbalanced-join/df1 and s3://test-imbalanced-join/df2
See also https://github.com/coiled/benchmarks/issues/652
In collaboration with a prospective dask user we were able to create a script that generates synthetic data similar to what they would be using in a production ETL job, see https://github.com/coiled/imbalanced-join
This data represents a rather skewed but realistic dataset that dask has been struggling with in the past significantly. Adding this to our workflows benchmarks would allow us to assess if the situation improved by now and investigate what needs to be improved to allow for a successful and fast execution of this kind of data.
As a first increment we should add it to the benchmarks CI and quickly reassess performance. With the introduction of P2P shuffling I expect very different outcomes (but chances are still high it could fail).
Note: The datasets are likely already present in
s3://test-imbalanced-join/df1
ands3://test-imbalanced-join/df2