Open randerzander opened 3 years ago
I believe it would also be useful to add join based hints as well to explicitly specify joins to go via the broadcast_join
path in dask-sql.
The syntax is similar to the one specified here: https://spark.apache.org/docs/3.0.0/sql-ref-syntax-qry-select-hints.html#join-hints
and the api used on the dask side would be the broadcast
param in the dask merge api. https://docs.dask.org/en/latest/generated/dask.dataframe.DataFrame.merge.html?highlight=merge#dask-dataframe-dataframe-merge
I can handle this one
In SQL, it's common to work w/ large data and aggregate or filter it down to few enough rows that it could be merged into a single partition in memory.
Today you can achieve this with something like:
It would be nice to support something like:
As a motivator, Dask DataFrames can use a broadcast or "map-side" join if one of the DataFrames consists of a single partition. Allowing users to specify partition coalescing hints will allow finer control over performance of Dask-SQL join performance.