apache / incubator-gluten

Gluten is a middle layer responsible for offloading JVM-based SQL engines' execution to native engines.
https://gluten.apache.org/
Apache License 2.0
1.22k stars 437 forks source link

Stage level resource management to handle offheap/onheap memory conflict #4392

Open FelixYBW opened 10 months ago

FelixYBW commented 10 months ago

Description

As Gluten currently can't fully cover spark functions, we will have to fallback some operators to Vanilla Spark, which leads an conflict of offheap and onheap memory conflict. Currently the suggested solution is that to those queries have both fallback and native operators, we need to config a large offheap memory and a large onheap memory, then the executor number should be decreased due to memory size constraint.

The other solution is to spill offheap memory to onheap memory which we have implemented but performance isn't good, it's not used.

The 3rd solution is to use Spark3.0's stage level resource management, when we detect a stage has fallback, we can fallback the full stage, then restart the executors with high onheap memory.

@zhouyuan @PHILO-HE

PHILO-HE commented 10 months ago

For the 3rd solution, it depends on the stability of our whole stage fallback policy. I will investigate the feasibility. Thanks!

FelixYBW commented 8 months ago

@PHILO-HE have you created a draft PR for this? any insight why the stage level resource manager can't be used by sparkSQL directly?

SinghAsDev commented 8 months ago

Also curious if this can be done in Apache Spark itself, there is some relevant discussion in this SPIP.

FelixYBW commented 8 months ago

@zhli1142015 Here is the related talk. Could you @ the guy so we can sync all here?