[spark] Make xgboost spark support large model size

dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

https://xgboost.readthedocs.io/en/stable/

Apache License 2.0

26.33k stars 8.73k forks source link

[spark] Make xgboost spark support large model size #10984

Closed WeichenXu123 closed 2 weeks ago

WeichenXu123 commented 2 weeks ago

Spark RDD can't support one line with very long content.

To make large size model training / saving / loading works, I split model json string to chunks when collecting model in training, and modify saving / loading code too.

wbo4958 commented 2 weeks ago

LGTM for the functionality except the CI issue.

wbo4958 commented 2 weeks ago

Could you run python tests/ci_build/lint_python.py --format=1 --type-check=1 --pylint=1 to check the python format

WeichenXu123 commented 2 weeks ago

I can't fully understand the linter error:

xgboost/spark/core.py:1162: error: Incompatible types in assignment (expression has type "str", variable has type "Booster")  [assignment]
xgboost/spark/core.py:1164: error: Argument 1 to "len" has incompatible type "Booster"; expected "Sized"  [arg-type]
Found 2 errors in 1 file (checked 41 source files)

@wbo4958 any ideas ?

trivialfis commented 2 weeks ago

@WeichenXu123 XGBoost's Python package uses Python typehint. In the following line:

booster = booster.save_raw("json").decode("utf-8")

The booster was a xgboost.Booster object, the decode("utf-8") however, returns a string. Assigning a string to a Booster type violates static typing.

wbo4958 commented 2 weeks ago

LGTM if the CI can pass

WeichenXu123 commented 1 week ago

@trivialfis Can we make a patch release to include this fix ? We have several customers facing the issue. thanks!

trivialfis commented 1 week ago

@WeichenXu123 https://github.com/dmlc/xgboost/issues/10992 .