dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.18k stars 8.71k forks source link

XGBoost 4J spark giving XGBoostError: std::bad_alloc on databricks #7155

Open jon-targaryen1995 opened 3 years ago

jon-targaryen1995 commented 3 years ago

I am using XGBoost 4J spark to create a distributed xgboost model on my data. I am developing my model in databricks.

The spark version is 3.1.1, scala 2.12 and XGBoost 4J 1.4.1

My cluster setup looks as below,

My cluster configuration looks as below, spark.executor.memory 9g spark.executor.cores 5

So basically I have 10 executors with 4.6GB memory and 1 driver with 3.3GB of memory.

image

I imported the package as below, import ml.dmlc.xgboost4j.scala.spark.{XGBoostRegressionModel,XGBoostRegressor}

In order to find the best parameters for my model, I created a parameter grid with train-validation split as shown below,

//Parameter tuning
    import org.apache.spark.ml.tuning._
    import org.apache.spark.ml.PipelineModel
    import Array._

//Create parameter grid 
    val paramGrid = new ParamGridBuilder()
        .addGrid(xgbRegressor.maxDepth, range(6, 10, 2))
        .addGrid(xgbRegressor.eta, Array(0.01))
        .addGrid(xgbRegressor.minChildWeight, Array(8.0, 10.0, 12.0, 14.0))
        .addGrid(xgbRegressor.gamma, Array(0, 0.25, 0.5, 1))
        .build()

I then fit it to my data and saved it,

 val trainValidationSplit = new TrainValidationSplit()
      .setEstimator(pipeline)
      .setEvaluator(evaluator)
      .setEstimatorParamMaps(paramGrid)
      .setTrainRatio(0.75)

val tvmodel = trainValidationSplit.fit(train)

tvmodel.write.overwrite().save("spark-train-validation-split-28072021")

The error comes up when I try to load the model again,

  import org.apache.spark.ml.tuning._
    val tvmodel = TrainValidationSplitModel.load("spark-train-validation-split-28072021")

The error message is XGBoostError: std::bad_alloc

I checked the executor and driver logs. The executor logs looked fine . I found the same error in driver logs (stderr and log4j files). Both the log files are attached here. stderr.txt log4j-2021-08-03-13.log

Since the error message was mainly found in the driver logs, I tried the following solutions,

But all the above failed. The log files clearly indicate that driver memory is not overloaded. Hence I am struggling to find out what the error actually is.

It would be great if one of you could point me in the right direction.

jon-targaryen1995 commented 3 years ago

On further investigation, I increased the driver size to 128GB and 32 cores. And got the same error.

I continuously monitored the driver metrics and none of it was over loaded.

Number of CPU used image

Memory usage image

CPU usage in capacity Atleast 85% of CPU capacity is idle image

Network image

The log4J has two different error messages as

The stderr has one error message that looks as below,

I am starting to think this is a bug with the package and not my cluster configuration. Maybe the package is not compatible with scala save and load methods.

trivialfis commented 3 years ago

@wbo4958 Have you seen something similar before?

kumarprabhu1988 commented 3 years ago

@jon-targaryen1995 I'm doing the exact same thing (same versions of scala, spark and xgboost4j) and I see the same error when I try to load my model. Any luck with this? I'm using EMR on aws.

trivialfis commented 3 years ago

Could you please try nightly build: https://xgboost.readthedocs.io/en/latest/install.html#id4 ?

wbo4958 commented 3 years ago

No, I didn't see this, seems the issue may be related to https://github.com/dmlc/xgboost/pull/7067

prashantprakash commented 2 years ago

Is this issue resolved ? I am getting the exact error.

wsurles-vamp commented 2 years ago

@jon-targaryen1995 What did you do to resolve the bad_alloc issues? (If you ever did) Nodes with more memory? Change a setting in your spark or xgboost params?

wbo4958 commented 2 years ago

@wsurles-vamp, which version were you using to repro this issue? Could you file an new issue and detailed steps to repro it? With that, I will check it. Thx

kstock commented 2 years ago

thanks so much @wbo4958 , (I am a coworker of @wsurles-vamp) this is sensitive data, I will need to create a fake dataset to reproduce to give to you and check what other details are ok to share in public. this is not on databricks.

I'll open a new issue soon with more details but here are some quick basics:

versions:

xgboost 1.5.1 (xgboost4j_2.12-1.5.1.jar / xgboost4j-spark_2.12-1.5.1.jar) xgboost 1.6 with fix from https://github.com/dmlc/xgboost/pull/7844

have seen with spark 3.1.2 and spark 3.2.1

we are using the pyspark wrapper from https://github.com/dmlc/xgboost/pull/4656#issuecomment-510693296 with only edit being adding in killSparkContextOnWorkerFailure param


some dataset info:

rows: 109 million rows with label=1 : 51 thousand cols: 254

'objective': 'binary:logistic' 'treeMethod': 'hist'

there are no cols of all 0s but some are very sparse.

we have a few datasets generating this error, this is just 1 of them, row/col count differs between them. we have some datasets of similar or bigger sizes not getting the error, and it can be intermittent with some rounds of hyperparam tuning getting it but others not, though we haven't notice a correlation between params like maxDepth and the error.


here is our specific stack trace:

ml.dmlc.xgboost4j.java.XGBoostError: std::bad_alloc
    at ml.dmlc.xgboost4j.java.XGBoostJNI.checkCall(XGBoostJNI.java:48)
    at ml.dmlc.xgboost4j.java.DMatrix.(DMatrix.java:54)
    at ml.dmlc.xgboost4j.scala.DMatrix.(DMatrix.scala:43)
    at ml.dmlc.xgboost4j.scala.spark.Watches$.buildWatches(XGBoost.scala:574)
    at ml.dmlc.xgboost4j.scala.spark.PreXGBoost$.$anonfun$trainForNonRanking$1(PreXGBoost.scala:480)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:863)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:863)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
    at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:386)
    at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1481)
    at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1408)
    at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1472)
    at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1295)
    at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:335)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
    at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:386)
    at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1481)
    at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1408)
    at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1472)
    at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1295)
    at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:335)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:131)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)
wsurles-vamp commented 2 years ago

A couple other quick points of note... We use dense vectors (to ensure 0s are treated as 0.) We have a 26 node spark cluster with 120GiB of memory, though we have seen it on nodes with 180. And well before memory has run out, as many others have mentioned. Seen it on centos7 and rocky linux 8

kstock commented 2 years ago

we have a better idea of what is going on now, we'll create a proper issue soon but might be a few days this is another related question https://discuss.xgboost.ai/t/useexternalmemory-no-longer-does-anything/2761

wbo4958 commented 2 years ago

FYI @kstock

From xgboost4j-spark 1.6.1, you don't need killSparkContextOnWorkerFailure parameter any more, since 1.6.1 has introduced the spark barrier execution mode to ensure every training task will be launched at the same time and spark will kill the failed training job if some training tasks failed. In a word, xgboost will not stop SparkContext any more.

we are using the pyspark wrapper from https://github.com/dmlc/xgboost/pull/4656#issuecomment-510693296 with only edit being adding in killSparkContextOnWorkerFailure param

Looks like the error happened in building DMatrix, From the error log "ml.dmlc.xgboost4j.java.XGBoostError: std::bad_alloc", looks like OOM issue?

@trivialfis any idea about this error log?

BTW, the dense vector may blow up the memory very quickly.

BTW, do you have got some chance to use xgboost4j-spark-gpu to train?

wsurles-vamp commented 2 years ago

@wbo4958, We don't run out of memory per se, but xgboost does seem to increase the size of the memory it requests by a factor of 1.5 and eventually this is a very big jump and too much for the system to provide, and this is when we get the bad_alloc error.

In this graph of system memory (top) and committed memory (bottom), you can see the increase in committed memory in stages as system memory is used. There are two fits on this chart. The first fit fails (in the middle of the chart) when the big committed memory jump happens. (note: there is still ~40GiB available when it fails, the bottom of the chart is not 0) Then the second fit finishes a little sooner and never has that last large memory commit, and it succeeded.

Screen Shot 2022-04-29 at 8 29 09 PM

With a debugger attached to custom build we saw Resizing ParallelGroupBuilder.data_ to 4505075712 elements at the time of the failure. The elements count was steadily increasing. But I think it's the memory request thats causing the failure.

We are looking into getting more memory on our cluster and spreading out the xgboost load as much as possible, but is there something we could do to make the committed memory grow more evenly? Or have xgboost not request more than is available?

wbo4958 commented 2 years ago

Hi @trivialfis, any idea about https://github.com/dmlc/xgboost/issues/7155#issuecomment-1113898292.

trivialfis commented 2 years ago

The QuantileDMatrix is coming to CPU as well if refactoring goes well. The ParallelGroupBuilder is used during construction of DMatrix.

olbapjose commented 2 years ago

Hi guys, do you have any update on this? I am getting this error when reading a Spark pipeline from mlflow. I get the error when I try to run the line loaded_model = mlflow.spark.load_model(run_path). Looks like XGBoost suddenly starts creating more and more jobs (a bit strange if I just want to load the model because I am doing nothing with it yet) and eventually crashes.

wbo4958 commented 2 years ago

Hi guys, do you have any update on this? I am getting this error when reading a Spark pipeline from mlflow. I get the error when I try to run the line loaded_model = mlflow.spark.load_model(run_path). Looks like XGBoost suddenly starts creating more and more jobs (a bit strange if I just want to load the model because I am doing nothing with it yet) and eventually crashes.

@olbapjose, loaded_model = mlflow.spark.load_model(run_path) is the python code? so you were trying to load the model directly, and it crashed?

olbapjose commented 2 years ago

@wbo4958 Exactly, I was trying to load a trained ML pipeline that I had saved previously, where one of the stages is a Spark XGBoost classifier, and it crashes when it tries to load the XGBoost stage. The stack trace looks like the one pasted in an aforementioned message. I'm working in a Databricks notebook.

EDIT: interestingly, the issue arises inside a Databricks notebook (see the attached file for the complete stack trace) but does not arise when using the databricks-connect package to load the model from a local pycharm IDE connected to the Databricks cluster.

stacktrace.txt

wbo4958 commented 2 years ago

@olbapjose Would you share the model with the zip format, then we can repro in databricks?

olbapjose commented 2 years ago

@wbo4958 I suspect it is closely related to #8074 , where I have uploaded the trained model I am trying to read. That issue arises both in a Databricks notebook and when using databricks-connect.

trivialfis commented 1 year ago

Regarding https://github.com/dmlc/xgboost/issues/7155#issuecomment-1113898292 . I think the best way forward is to improve support for QuantileDMatrix in jvm packages.