coiled / benchmarks

BSD 3-Clause "New" or "Revised" License
32 stars 17 forks source link

Improve integration testing of joining small and large dataframe #669

Open hendrikmakait opened 1 year ago

hendrikmakait commented 1 year ago

test_join_big_small materializes the small dataframe before merging, which circumvents a distributed join.

https://github.com/coiled/coiled-runtime/blob/ef1fe4e983afc29ba80b8adbb056cefc27788e04/tests/benchmarks/test_join.py#L56-L58

While this is a reasonable thing for a user to do if they know the size of their data and want to optimize their code, we should also test the performance without an early materialization.

ncclementi commented 1 year ago

What I recall for this test, is that we wanted to make sure that a join of the form dask-df with a pandas df was fast. I'm not sure if this is the right way to achieve it, but just a heads up of what was the intention behind

hendrikmakait commented 1 year ago

Thanks for the additional context, @ncclementi. I think that's a valid case, but we should probably rename the test or check for more variants then (i.e., broadcasts, tasks, and p2p).