apache / incubator-gluten

Gluten is a middle layer responsible for offloading JVM-based SQL engines' execution to native engines.
https://gluten.apache.org/
Apache License 2.0
1.22k stars 438 forks source link

[VL] VeloxBackend should know it run in executor or driver #7837

Closed leoluan2009 closed 1 week ago

leoluan2009 commented 2 weeks ago

Description

VeloxBackend show know where it run, executor or driver? for example if if run driver ,it should not init velox cache. There are two methods to this enhancement:

  1. pass a config spark.gluten.isDriver when creating VeloxBackend instance, this is simple.
  2. Add a member variables in VeloxBackend class, need to change jni code.
leoluan2009 commented 2 weeks ago

@zhztheplayer @zhouyuan can you give some thoughts? thanks!

zhztheplayer commented 2 weeks ago

Driver and executor do have different plugin entrypoints, https://github.com/apache/incubator-gluten/blob/main/backends-velox/src/main/scala/org/apache/gluten/backendsapi/velox/VeloxListenerApi.scala, are you suggesting a new approach?

leoluan2009 commented 2 weeks ago

Driver and executor do have different plugin entrypoints, https://github.com/apache/incubator-gluten/blob/main/backends-velox/src/main/scala/org/apache/gluten/backendsapi/velox/VeloxListenerApi.scala, are you suggesting a new approach?

But in VeloxBackend.cc, we can not know where it run. The info do not pass from java code to cpp code

zhztheplayer commented 2 weeks ago

Driver and executor do have different plugin entrypoints, https://github.com/apache/incubator-gluten/blob/main/backends-velox/src/main/scala/org/apache/gluten/backendsapi/velox/VeloxListenerApi.scala, are you suggesting a new approach?

But in VeloxBackend.cc, we can not know where it run. The info do not pass from java code to cpp code

I see. Do you know which part of C++ code requires for this information?

leoluan2009 commented 2 weeks ago

Driver and executor do have different plugin entrypoints, https://github.com/apache/incubator-gluten/blob/main/backends-velox/src/main/scala/org/apache/gluten/backendsapi/velox/VeloxListenerApi.scala, are you suggesting a new approach?

But in VeloxBackend.cc, we can not know where it run. The info do not pass from java code to cpp code

I see. Do you know which part of C++ code requires for this information?

if it run in driver, it should not init velox cache. https://github.com/apache/incubator-gluten/blob/main/cpp/velox/compute/VeloxBackend.cc#L197

zhztheplayer commented 2 weeks ago

I am curious why it matters to initialize the cache in driver or not. Do you already see some issues or errors in your circumstance?

BTW I'll prefer changing the JNI API to have different paths for driver / executor native initializations if we have to do it.

leoluan2009 commented 2 weeks ago

I am curious why it matters to initialize the cache in driver or not. Do you already see some issues or errors in your circumstance?

BTW I'll prefer changing the JNI API to have different patches for driver / executor native initializations if we have to do it.

Yes, when initialize the cache, it will create cache dir and check remaining disk capacity while spark driver node may has smaller disk than executor.

FelixYBW commented 2 weeks ago

Do we start any Velox pipeline on driver today? Where the cache is initialized?

Looks only the BHJ's hash build may be run on driver which we haven't implemented yet.

LoseYSelf commented 2 weeks ago

Do we start any Velox pipeline on driver today? Where the cache is initialized?

Looks only the BHJ's hash build may be run on driver which we haven't implemented yet.

this line will check ssd space. https://github.com/apache/incubator-gluten/blob/c653337cdf54067cd4a01d14b908a521fdd11b3a/cpp/velox/compute/VeloxBackend.cc#L217

FelixYBW commented 2 weeks ago

Thank you. Then we should initialize velox on driver and worker differently.