Open ajordan-apixio opened 3 years ago
You can find more details from executor's log file. You can find the log from YARN UI. To confirm if it's out of memory issue, you may collect sar data on each worker node. The used memory - cached memory will increase to 99% then fails.
Hey all,
Hoping I can get some help with this. I have a dataset with about 27 million rows and 300 columns, and I cannot figure out how to get training to work. It always fails with this error:
I don't see anything else useful anywhere in the logs. I'm using XGBoost 1.1.0, Spark 2.3.2, and running the code in a Zeppelin notebook.
I'm using a simple Spark ML pipeline to create the vector for training. I am aware of the compatibility issues with VectorAssembler documented here, but my dataset has very few zeros and no nulls. Here is the pipeline I'm using to assemble the vectors:
`val featureIndexers = Array("company", "business") .map(c => new StringIndexer().setHandleInvalid("keep").setInputCol(c).setOutputCol(c + "_index").fit(trainingData))
val oneHot = new OneHotEncoderEstimator() .setInputCols(Array("company_index", "business_index")) .setOutputCols(Array("company_ohe", "business_ohe"))
val vectorAssembler = new VectorAssembler() .setInputCols(colsFeatures.toArray) .setOutputCol("features")`
And here is the XGBoostClassifier:
I intentionally set the max depth extremely low to ensure that I wasn't running into memory issues with depth.
When I call fit, I receive the error I posted above. I've attached a screenshot of what the Yarn UI looks like right before the job fails. Here is the line being referenced in the UI. I'm not sure why it's referring to the SparseVector line. I've tried to do explicit conversions to a dense vector within a UDF just to confirm, and I still received the problem.
The size of the cluster or executors does not seem to affect the outcome, but using only a subset of my data does. It works if I only take 1 million rows, for example. This implies that it's a data size issue, but I have not yet been able to resolve it.
I've tried various combinations of Spark configuration parameters without success, mainly around these three:
I'm at a loss of how to address this. If there are any suggestions, please let me know.