coiled / imbalanced-join

BSD 3-Clause "New" or "Revised" License
2 stars 0 forks source link

Post join scale #1

Open CerebralMastication opened 2 years ago

CerebralMastication commented 2 years ago

Hey all, I was reviewing some notes from our real life exercise and my heuristic of "5x size after the join" was wrong.

Here's the original stats:

3.8B records in df1 before the join 52.2B records after joining df1 to df2 21.1 records after the post join group by

The step that makes this difficult is the many to many join that goes from 3.8B to 52.2B records.

rjzamora commented 2 years ago

The step that makes this difficult is the many to many join that goes from 3.8B to 52.2B records.

So the challenge in dask is that the total number of records increases and that the distribution of output partition sizes is extremely imbalanced. Is that correct? (sorry if this is stating the obvious)

CerebralMastication commented 2 years ago

yes.. depending on how the data is partitioned.

the join step works best if the data is partitioned by bucket because every key has the same number of buckets. That gives even distribution across the nodes. Even in that situation, the join can get slow/unstable as compared to Spark. I don't know enough system internals of either platform to assert why that is.

ncclementi commented 1 year ago

This merged PR includes a notebook that creates df1 (dataset) using coiled following @CerebralMastication code.

We have them now on S3, it is not public data yet as we might want to keep improving this but I'd be happy to put it somewhere public, if we think this is it.

It also includes a notebook, where I run the set_index (step_1) and bigjoin (step_2) operations that were in the original workflow, and I'm including the performance reports too. (links can be find in the README) After the bigjoin:

>>> len(ddf_bigjoin.value)
63514149896   # 63.5 billion 

@CerebralMastication Is this in the order of what you expected?

CerebralMastication commented 1 year ago

that's certainly in the ballpark of my expectations.

looking at the output it also looks like it ripped through it pretty fast. is the performance on par with our prior work?

compared to our prior work this is the 1x load.. did we start having issue at 5x? I'm honestly a little fuzzy on remembering the details.

CerebralMastication commented 1 year ago

I'm worried that I didn't get as much skew in the conjoined results...