dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
25.68k stars 8.67k forks source link

[pyspark] Add tracker_on_driver to decide where the tracker will be launched #10281

Open wbo4958 opened 2 weeks ago

wbo4958 commented 2 weeks ago
df_train = spark.createDataFrame(
    [
        (Vectors.dense(1.0, 2.0, 3.0), 0, False, 1.0),
        (Vectors.sparse(3, {1: 1.0, 2: 5.5}), 1, False, 2.0),
        (Vectors.dense(4.0, 5.0, 6.0), 0, True, 1.0),
        (Vectors.sparse(3, {1: 6.0, 2: 7.5}), 1, True, 2.0),
    ]
    * 100,
    ["features", "label", "isVal", "weight"],
)

from xgboost.spark import SparkXGBRegressor

callbacks = EvaluationMonitor()
xgb_regressor = SparkXGBRegressor(
    num_workers=5,
    callbacks=[callbacks],
    tracker_on_driver=True,
    validation_indicator_col="isVal",
)
xgb_reg_model = xgb_regressor.fit(df_train)

With the above test code, The below log will be printed on the driver. Or else, they will be printed on the executor side.

[0] training-rmse:0.35149   validation-rmse:0.35149
[0] training-rmse:0.35149   validation-rmse:0.35149
[1] training-rmse:0.24708   validation-rmse:0.24708
[1] training-rmse:0.24708   validation-rmse:0.24708
[2] training-rmse:0.17369   validation-rmse:0.17369
[2] training-rmse:0.17369   validation-rmse:0.17369
[3] training-rmse:0.12210   validation-rmse:0.12210
[3] training-rmse:0.12210   validation-rmse:0.12210
[4] training-rmse:0.08583   validation-rmse:0.08583
[4] training-rmse:0.08583   validation-rmse:0.08583
[5] training-rmse:0.06034   validation-rmse:0.06034
[5] training-rmse:0.06034   validation-rmse:0.06034
[6] training-rmse:0.04242   validation-rmse:0.04242
...