XGBoostModel training failed on large dataset, works on subset of dataset

Hey all,

Hoping I can get some help with this. I have a dataset with about 27 million rows and 300 columns, and I cannot figure out how to get training to work. It always fails with this error:

ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed
  at ml.dmlc.xgboost4j.scala.spark.XGBoost$.postTrackerReturnProcessing(XGBoost.scala:697)
  at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainDistributed(XGBoost.scala:572)
  at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:190)
  at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:40)
  at org.apache.spark.ml.Predictor.fit(Predictor.scala:118)

I don't see anything else useful anywhere in the logs. I'm using XGBoost 1.1.0, Spark 2.3.2, and running the code in a Zeppelin notebook.

I'm using a simple Spark ML pipeline to create the vector for training. I am aware of the compatibility issues with VectorAssembler documented here, but my dataset has very few zeros and no nulls. Here is the pipeline I'm using to assemble the vectors:

`val featureIndexers = Array("company", "business") .map(c => new StringIndexer().setHandleInvalid("keep").setInputCol(c).setOutputCol(c + "_index").fit(trainingData))

val oneHot = new OneHotEncoderEstimator() .setInputCols(Array("company_index", "business_index")) .setOutputCols(Array("company_ohe", "business_ohe"))

val vectorAssembler = new VectorAssembler() .setInputCols(colsFeatures.toArray) .setOutputCol("features")`

And here is the XGBoostClassifier:

val xgb = new XGBoostClassifier()
    .setNumWorkers(40) // just a few less than the total number available
    .setNthread(4) // same as spark.task.cpus
    .setTreeMethod("hist")
    .setObjective("binary:logistic")
    .setMaximizeEvaluationMetrics(true)
    .setSeed(1)
    .setTrainTestRatio(0.9)
    .setSubsample(0.9)
    .setEta(0.01)
    .setLabelCol("label")
    .setFeaturesCol("features")
    .setNumRound(50000)
    .setNumEarlyStoppingRounds(45)
    .setMaxDepth(3)

I intentionally set the max depth extremely low to ensure that I wasn't running into memory issues with depth.

When I call fit, I receive the error I posted above. I've attached a screenshot of what the Yarn UI looks like right before the job fails. Here is the line being referenced in the UI. I'm not sure why it's referring to the SparseVector line. I've tried to do explicit conversions to a dense vector within a UDF just to confirm, and I still received the problem.

The size of the cluster or executors does not seem to affect the outcome, but using only a subset of my data does. It works if I only take 1 million rows, for example. This implies that it's a data size issue, but I have not yet been able to resolve it.

I've tried various combinations of Spark configuration parameters without success, mainly around these three:

spark.task.cpus 4
spark.executor.cores 8
spark.executor.instances 47

I'm at a loss of how to address this. If there are any suggestions, please let me know.

dmlc / xgboost

XGBoostModel training failed on large dataset, works on subset of dataset #6929