[spark] xgboost.spark supports saving model that is very large

WeichenXu123 commented 1 year ago

Currently, xgboost.spark saves model by:

        booster = xgb_model.get_booster().save_raw("json").decode("utf-8")
        _get_spark_session().sparkContext.parallelize([booster], 1).saveAsTextFile(
            model_save_path
        )

when booster is a string larger than 2GB, spark will raise error like:

java.lang.OutOfMemoryError: Requested array size exceeds VM limit
        at java.lang.StringCoding.encode(StringCoding.java:350)
        at java.lang.String.getBytes(String.java:941)
        at org.apache.spark.unsafe.types.UTF8String.fromString(UTF8String.java:163)
        at org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$$nestedInanonfun$makeFromJava$11$1.applyOrElse(EvaluatePython.scala:149)
        at org.apache.spark.sql.execution.python.EvaluatePython$.nullSafeConvert(EvaluatePython.scala:213)
        at org.apache.spark.sql.execution.python.EvaluatePython$.$anonfun$makeFromJava$11(EvaluatePython.scala:148)
        at org.apache.spark.sql.execution.python.EvaluatePython$$$Lambda$8923/940497638.apply(Unknown Source)
        at org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$$nestedInanonfun$makeFromJava$16$1.applyOrElse(EvaluatePython.scala:195)
        at org.apache.spark.sql.execution.python.EvaluatePython$.nullSafeConvert(EvaluatePython.scala:213)
        at org.apache.spark.sql.execution.python.EvaluatePython$.$anonfun$makeFromJava$16(EvaluatePython.scala:182)
        at org.apache.spark.sql.execution.python.EvaluatePython$$$Lambda$8925/1092357589.apply(Unknown Source)
        at org.apache.spark.sql.SparkSession.$anonfun$applySchemaToPythonRDD$2(SparkSession.scala:960)
        at org.apache.spark.sql.SparkSession$$Lambda$8929/213457688.apply(Unknown Source)

We need to consider split the model string and save it.

WeichenXu123 commented 1 year ago

CC @wbo4958 Thoughts ?

trivialfis commented 1 year ago

One can use bytes (ubjson) instead of str once cuDF support is available

WeichenXu123 commented 1 year ago

One can use bytes (ubjson) instead of str once cuDF support is available

Very long bytes should cause similar JVM error, we should split it into multiple bytes array or string before saving

WeichenXu123 commented 1 year ago

CC @wbo4958 Do you have good idea to address it , while at the same time to keep the same saving format ?

For databricks runtime, we can copy the model json file to dbfs directly using DBFS fuse feature. But for other distributed filesystem, we might need to use https://arrow.apache.org/docs/python/filesystems.html , WDYT ?

wbo4958 commented 1 year ago

Sorry for the late response. Hmm, seems it's really a problem for large size model.

For databricks runtime, we can copy the model json file to dbfs directly using DBFS fuse feature. But for other distributed filesystem, we might need to use https://arrow.apache.org/docs/python/filesystems.html , WDYT ?

Seems this is doable for the distributed filesystems.

How about the idea of "split it into multiple bytes array or string before saving"

Does xgboost native help to split? or pyspark-xgboost does that?

WeichenXu123 commented 1 year ago

Does xgboost native help to split?

I don't think so. It is JSON format.

WeichenXu123 commented 1 year ago

@wbo4958

Do you have time to file a PR that change saving to use https://arrow.apache.org/docs/python/filesystems.html ? at the same time, we can make an optimization that directly save the model to JSON file (then copy it to distributed filesystem path), it will save memory significantly for large model saving.

Then I will file a follow-up PR to support databricks dbfs:// path.

trivialfis commented 1 year ago

another way to do this is to first split up the model and then concatenate them back: https://github.com/dmlc/xgboost/issues/8709 .

wbo4958 commented 1 year ago

another way to do this is to first split up the model and then concatenate them back: https://github.com/dmlc/xgboost/issues/8709 .

Cool, that can help to resolve this issue.

Do you have time to file a PR that change saving to use https://arrow.apache.org/docs/python/filesystems.html ? at the same time, we can make an optimization that directly save the model to JSON file (then copy it to distributed filesystem path), it will save memory significantly for large model saving.

@WeichenXu123 This idea seems great, but it really needs to do much code changing from estimator to model including est/model persistence. And if we're trying to support this, seems xgboost can't do the prediction on the driver side anymore and xgboost may not be used for other library except downloading the model to the driver.

WeichenXu123 commented 1 year ago

If splitting up the model is a easier approach, we can adopt it.

WeichenXu123 commented 1 year ago

@wbo4958

Q: Can we add a saving option that to make it save as a spark dataframe ? Because spark 4.0 the "spark connect client" mode is out, and in spark connect client mode, all RDD APIs are not supported any more.

wbo4958 commented 1 year ago

Q: Can we add a saving option that to make it save as a spark dataframe ? Because spark 4.0 the "spark connect client" mode is out, and in spark connect client mode, all RDD APIs are not supported any more.

sorry, I didn't get you, what to save as a spark dataframe? @WeichenXu123

WeichenXu123 commented 1 year ago

Q: Can we add a saving option that to make it save as a spark dataframe ? Because spark 4.0 the "spark connect client" mode is out, and in spark connect client mode, all RDD APIs are not supported any more.

sorry, I didn't get you, what to save as a spark dataframe? @WeichenXu123

We can ignore the issue for now. About spark connect roadmap we are still in discussion and roadmap is not clear currently.

wbo4958 commented 1 year ago

@WeichenXu123 ok, thx for the info

dmlc / xgboost

[spark] xgboost.spark supports saving model that is very large #8818