Open andygrove opened 2 months ago
EDIT: I must have messed something up. Looking at the queryplan my test run did not trigger the BloomMightContain logic. I guss I missed this part broadcast hash joins are disabled
in the ticket.
I had a quick look at this in a profiler and to me it did not look like much time was spent in the https://github.com/eejbyfeldt/datafusion-comet/blob/a99f7428398793507b31188c8919e4cf128d8d38/native/core/src/execution/operators/scan.rs#L353-L370
BloomFilterMightContain
related code. The things that stood out as taking significant time was copying of data due to unpacking of dictionaries during the scan operation here:and the copying done in the FilterExec which sounds similar to what is discussed in https://github.com/apache/datafusion-comet/issues/808
So maybe this ticket should be replace with one about removing the unpacking of dictionaries in the scan operator.
What is the problem the feature request solves?
Comet currently performances poorly with the following query when broadcast hash joins are disabled and when Comet native shuffle is disabled.
Benchmark results (running aginst sf=100 dataset):
Sample native plan with metrics:
Describe the potential solution
No response
Additional context
No response