dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.1k stars 8.7k forks source link

[spark] Make xgboost.spark support spark connect ML #9780

Open WeichenXu123 opened 10 months ago

WeichenXu123 commented 10 months ago

Since spark 3.5, a new pyspark module is added: pyspark.ml.connect, it supports a few ML algorithms that runs on spark connect mode. This is design doc: https://www.google.com/url?q=https://docs.google.com/document/d/1LHzwCjm2SluHkta_08cM3jxFSgfF-niaCZbtIThG-H8/edit&sa=D&source=calendar&ust=1700005806011038&usg=AOvVaw2VEdVyMYg40yDLpElhcRAu

We should make estimators defined in xgboost.spark to support spark connect mode, to achieve the goal, we need:

WeichenXu123 commented 10 months ago

CC @wbo4958

wbo4958 commented 10 months ago

Yes, that is a good suggestion. However, I have a concern that the spark dataframe API hasn't supported stage-level scheduling yet. in that case, do we need to force only 1 task running on the executor?

WeichenXu123 commented 10 months ago

Yes, that is a good suggestion. However, I have a concern that the spark dataframe API hasn't supported stage-level scheduling yet. in that case, do we need to force only 1 task running on the executor?

It is supported, see:

https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.mapInPandas.html?highlight=mapinpandas

and

https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.mapInArrow.html?highlight=mapinarrow#pyspark.sql.DataFrame.mapInArrow

New barrier argument is added for them. @wbo4958

wbo4958 commented 10 months ago

Hi @WeichenXu123, I mean the stage-level scheduling not the barrier execution. I guess we can support DataFrame stage-level scheduling in Spark for specific APIs like mapInPandas and mapInArrow using the same way as the barrier supporting. That will be cool.

wbo4958 commented 10 months ago

BTW, I'd like to take this task to make xgboost.spark support spark connect ML.

WeichenXu123 commented 10 months ago

Hi @WeichenXu123, I mean the stage-level scheduling not the barrier execution. I guess we can support DataFrame stage-level scheduling in Spark for specific APIs like mapInPandas and mapInArrow using the same way as the barrier supporting. That will be cool.

oh, sorry for my misread, yes we haven't support stage-level scheduling in spark connect api, this is a todo task

WeichenXu123 commented 10 months ago

We can add stage-level scheduling params in mapInPandas API, similar to barrier param, CC @zhengruifeng @Ngone51 WDYT ?

zhengruifeng commented 10 months ago

We can add stage-level scheduling params in mapInPandas API, similar to barrier param, CC @zhengruifeng @Ngone51 WDYT ?

I think this way is feasible.

wbo4958 commented 10 months ago

Cool, let me have the PR supporting stage-level scheduling for Dataframe API for spark @WeichenXu123 @zhengruifeng @Ngone51

trivialfis commented 4 weeks ago

Hi, may I ask what's the current status of this?