Open WeichenXu123 opened 10 months ago
CC @wbo4958
Yes, that is a good suggestion. However, I have a concern that the spark dataframe API hasn't supported stage-level scheduling yet. in that case, do we need to force only 1 task running on the executor?
Yes, that is a good suggestion. However, I have a concern that the spark dataframe API hasn't supported stage-level scheduling yet. in that case, do we need to force only 1 task running on the executor?
It is supported, see:
and
New barrier
argument is added for them.
@wbo4958
Hi @WeichenXu123, I mean the stage-level scheduling not the barrier execution. I guess we can support DataFrame stage-level scheduling in Spark for specific APIs like mapInPandas and mapInArrow using the same way as the barrier supporting. That will be cool.
BTW, I'd like to take this task to make xgboost.spark support spark connect ML.
Hi @WeichenXu123, I mean the stage-level scheduling not the barrier execution. I guess we can support DataFrame stage-level scheduling in Spark for specific APIs like mapInPandas and mapInArrow using the same way as the barrier supporting. That will be cool.
oh, sorry for my misread, yes we haven't support stage-level scheduling in spark connect api, this is a todo task
We can add stage-level scheduling params in mapInPandas
API, similar to barrier
param, CC @zhengruifeng @Ngone51 WDYT ?
We can add stage-level scheduling params in
mapInPandas
API, similar tobarrier
param, CC @zhengruifeng @Ngone51 WDYT ?
I think this way is feasible.
Cool, let me have the PR supporting stage-level scheduling for Dataframe API for spark @WeichenXu123 @zhengruifeng @Ngone51
Hi, may I ask what's the current status of this?
Since spark 3.5, a new pyspark module is added:
pyspark.ml.connect
, it supports a few ML algorithms that runs on spark connect mode. This is design doc: https://www.google.com/url?q=https://docs.google.com/document/d/1LHzwCjm2SluHkta_08cM3jxFSgfF-niaCZbtIThG-H8/edit&sa=D&source=calendar&ust=1700005806011038&usg=AOvVaw2VEdVyMYg40yDLpElhcRAuWe should make estimators defined in
xgboost.spark
to support spark connect mode, to achieve the goal, we need:pyspark.ml.connect.Estimator
if it runs on spark connect mode