apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.66k stars 1.06k forks source link

`NestedLoopsJoin` memory tracking may be insufficient #8952

Open alamb opened 6 months ago

alamb commented 6 months ago

Is your feature request related to a problem or challenge?

Similarly to https://github.com/apache/arrow-datafusion/issues/7848, @metesynnada noted https://github.com/apache/arrow-datafusion/pull/8020#issuecomment-1903359773 that it is possible for NestedLoopsJoin to generate a single (very) large RecordBatch. For certain pathalogical queries this may lead to DataFusion far exceeding its memory limits and erroring out

Describe the solution you'd like

Implement / adapt the same approach as @korowa did in https://github.com/apache/arrow-datafusion/pull/8020 (❤️ ) to incrementally create join output for joins that match many keys rather than doing it all at once.

Describe alternatives you've considered

No response

Additional context

No response

yyy1000 commented 6 months ago

I'd like a try to help it. :)

alamb commented 6 months ago

THis one may be tricky, FWIW. The join code is not simple.

yyy1000 commented 6 months ago

Aha, seems true. Maybe I can leave it here now and find some not so difficult. And I think I could fix it when I get more familiar with the code.❤️