Open hendrikmakait opened 1 year ago
What I recall for this test, is that we wanted to make sure that a join of the form dask-df with a pandas df was fast. I'm not sure if this is the right way to achieve it, but just a heads up of what was the intention behind
Thanks for the additional context, @ncclementi. I think that's a valid case, but we should probably rename the test or check for more variants then (i.e., broadcasts, tasks, and p2p).
test_join_big_small
materializes the small dataframe before merging, which circumvents a distributed join.https://github.com/coiled/coiled-runtime/blob/ef1fe4e983afc29ba80b8adbb056cefc27788e04/tests/benchmarks/test_join.py#L56-L58
While this is a reasonable thing for a user to do if they know the size of their data and want to optimize their code, we should also test the performance without an early materialization.