Open FelixYBW opened 10 months ago
For the 3rd solution, it depends on the stability of our whole stage fallback policy. I will investigate the feasibility. Thanks!
@PHILO-HE have you created a draft PR for this? any insight why the stage level resource manager can't be used by sparkSQL directly?
Also curious if this can be done in Apache Spark itself, there is some relevant discussion in this SPIP.
@zhli1142015 Here is the related talk. Could you @ the guy so we can sync all here?
Description
As Gluten currently can't fully cover spark functions, we will have to fallback some operators to Vanilla Spark, which leads an conflict of offheap and onheap memory conflict. Currently the suggested solution is that to those queries have both fallback and native operators, we need to config a large offheap memory and a large onheap memory, then the executor number should be decreased due to memory size constraint.
The other solution is to spill offheap memory to onheap memory which we have implemented but performance isn't good, it's not used.
The 3rd solution is to use Spark3.0's stage level resource management, when we detect a stage has fallback, we can fallback the full stage, then restart the executors with high onheap memory.
@zhouyuan @PHILO-HE