dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.28k stars 8.73k forks source link

XGBoostModel training failed on large dataset, works on subset of dataset #6929

Open ajordan-apixio opened 3 years ago

ajordan-apixio commented 3 years ago

Hey all,

Hoping I can get some help with this. I have a dataset with about 27 million rows and 300 columns, and I cannot figure out how to get training to work. It always fails with this error:

ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed
  at ml.dmlc.xgboost4j.scala.spark.XGBoost$.postTrackerReturnProcessing(XGBoost.scala:697)
  at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainDistributed(XGBoost.scala:572)
  at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:190)
  at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:40)
  at org.apache.spark.ml.Predictor.fit(Predictor.scala:118)

I don't see anything else useful anywhere in the logs. I'm using XGBoost 1.1.0, Spark 2.3.2, and running the code in a Zeppelin notebook.

I'm using a simple Spark ML pipeline to create the vector for training. I am aware of the compatibility issues with VectorAssembler documented here, but my dataset has very few zeros and no nulls. Here is the pipeline I'm using to assemble the vectors:

`val featureIndexers = Array("company", "business") .map(c => new StringIndexer().setHandleInvalid("keep").setInputCol(c).setOutputCol(c + "_index").fit(trainingData))

val oneHot = new OneHotEncoderEstimator() .setInputCols(Array("company_index", "business_index")) .setOutputCols(Array("company_ohe", "business_ohe"))

val vectorAssembler = new VectorAssembler() .setInputCols(colsFeatures.toArray) .setOutputCol("features")`

And here is the XGBoostClassifier:

val xgb = new XGBoostClassifier()
    .setNumWorkers(40) // just a few less than the total number available
    .setNthread(4) // same as spark.task.cpus
    .setTreeMethod("hist")
    .setObjective("binary:logistic")
    .setMaximizeEvaluationMetrics(true)
    .setSeed(1)
    .setTrainTestRatio(0.9)
    .setSubsample(0.9)
    .setEta(0.01)
    .setLabelCol("label")
    .setFeaturesCol("features")
    .setNumRound(50000)
    .setNumEarlyStoppingRounds(45)
    .setMaxDepth(3)

I intentionally set the max depth extremely low to ensure that I wasn't running into memory issues with depth.

When I call fit, I receive the error I posted above. I've attached a screenshot of what the Yarn UI looks like right before the job fails. Here is the line being referenced in the UI. I'm not sure why it's referring to the SparseVector line. I've tried to do explicit conversions to a dense vector within a UDF just to confirm, and I still received the problem.

The size of the cluster or executors does not seem to affect the outcome, but using only a subset of my data does. It works if I only take 1 million rows, for example. This implies that it's a data size issue, but I have not yet been able to resolve it.

I've tried various combinations of Spark configuration parameters without success, mainly around these three:

spark.task.cpus 4
spark.executor.cores 8
spark.executor.instances 47

I'm at a loss of how to address this. If there are any suggestions, please let me know.

Screen Shot 2021-05-01 at 2 24 19 AM
FelixYBW commented 3 years ago

You can find more details from executor's log file. You can find the log from YARN UI. To confirm if it's out of memory issue, you may collect sar data on each worker node. The used memory - cached memory will increase to 99% then fails.