Imbalance join workflow

In collaboration with a prospective dask user we were able to create a script that generates synthetic data similar to what they would be using in a production ETL job, see https://github.com/coiled/imbalanced-join

This data represents a rather skewed but realistic dataset that dask has been struggling with in the past significantly. Adding this to our workflows benchmarks would allow us to assess if the situation improved by now and investigate what needs to be improved to allow for a successful and fast execution of this kind of data.

As a first increment we should add it to the benchmarks CI and quickly reassess performance. With the introduction of P2P shuffling I expect very different outcomes (but chances are still high it could fail).

Note: The datasets are likely already present in s3://test-imbalanced-join/df1 and s3://test-imbalanced-join/df2

coiled / benchmarks

Imbalance join workflow #882