apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.37k stars 1.2k forks source link

Add MemoryReservation to batch splitting in joins #13003

Open alamb opened 1 month ago

alamb commented 1 month ago

Is your feature request related to a problem or challenge?

Follow on to https://github.com/apache/datafusion/pull/12969 and https://github.com/apache/datafusion/issues/12633

In https://github.com/apache/datafusion/issues/12633 @mhilton noted that joins sometimes generate giant record batches which causes issues. @alihan-synnada fixed this in https://github.com/apache/datafusion/pull/12969 but internally sometimes the joins still generate giant output batches.

As @mhilton says in https://github.com/apache/datafusion/pull/12969#issuecomment-2418862655

Unfortunately this doesn't address the actual problem with creating giant batches, which is they require a lot of memory and that memory isn't accounted for in any MemoryPool. Wiring a MemoryReservation into BatchSplitter would probably be enough to address this though.

Describe the solution you'd like

I would like the memory accounting to take into account the large output batch

Describe alternatives you've considered

Wiring a MemoryReservation into BatchSplitter would probably be enough to address

Additional context

No response

jatin510 commented 1 month ago

can i work on this task @alamb ?

alamb commented 1 month ago

@jatin510 of course -- see the guide here https://datafusion.apache.org/contributor-guide/index.html#open-contribution-and-assigning-tickets !

jatin510 commented 1 month ago

take