dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.3k stars 8.73k forks source link

[jvm-packages] XGBoostClassifier training fails with large data on a multi-node cluster #6489

Open monicasenapati opened 3 years ago

monicasenapati commented 3 years ago

Hi, I have a pipeline of hyperparameter tuning, evaluator, and cross-validate on an XGBoostClassifier model. However, I run into the following issue and was wondering if I could get some help understanding what it means. Any suggestion or insight will be greatly appreciated. Also, I can provide more information on this, if required.

20/12/10 09:39:34 ERROR XGBoostTaskFailedListener: Training Task Failed during XGBoost Training: ExceptionFailure(ml.dmlc.xgboost4j.java.XGBoostError,[09:39:34] /workspace/jvm-packages/xgboost4j/src/native/xgboost4j.cpp:159: [09:39:34] /workspace/jvm-packages/xgboost4j/src/native/xgboost4j.cpp:78: Check failed: jenv->ExceptionOccurred(): 
Stack trace:
  [bt] (0) /tmp/libxgboost4j169322301632920248.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x57) [0x7f6600a5e947]
  [bt] (1) /tmp/libxgboost4j169322301632920248.so(XGBoost4jCallbackDataIterNext+0x2d55) [0x7f6600a5b595]
  [bt] (2) /tmp/libxgboost4j169322301632920248.so(xgboost::data::SimpleDMatrix::SimpleDMatrix<xgboost::data::IteratorAdapter>(xgboost::data::IteratorAdapter*, float, int)+0x2c0) [0x7f6600b1cda0]
  [bt] (3) /tmp/libxgboost4j169322301632920248.so(xgboost::DMatrix* xgboost::DMatrix::Create<xgboost::data::IteratorAdapter>(xgboost::data::IteratorAdapter*, float, int, std::string const&, unsigned long)+0x45) [0x7f6600b11d15]
  [bt] (4) /tmp/libxgboost4j169322301632920248.so(XGDMatrixCreateFromDataIter+0x153) [0x7f6600a5f943]
  [bt] (5) /tmp/libxgboost4j169322301632920248.so(Java_ml_dmlc_xgboost4j_java_XGBoostJNI_XGDMatrixCreateFromDataIter+0x96) [0x7f6600a57426]
  [bt] (6) [0x7f68ad018427]

Stack trace:
  [bt] (0) /tmp/libxgboost4j169322301632920248.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x57) [0x7f6600a5e947]
  [bt] (1) /tmp/libxgboost4j169322301632920248.so(XGBoost4jCallbackDataIterNext+0x2664) [0x7f6600a5aea4]
  [bt] (2) /tmp/libxgboost4j169322301632920248.so(xgboost::data::SimpleDMatrix::SimpleDMatrix<xgboost::data::IteratorAdapter>(xgboost::data::IteratorAdapter*, float, int)+0x2c0) [0x7f6600b1cda0]
  [bt] (3) /tmp/libxgboost4j169322301632920248.so(xgboost::DMatrix* xgboost::DMatrix::Create<xgboost::data::IteratorAdapter>(xgboost::data::IteratorAdapter*, float, int, std::string const&, unsigned long)+0x45) [0x7f6600b11d15]
  [bt] (4) /tmp/libxgboost4j169322301632920248.so(XGDMatrixCreateFromDataIter+0x153) [0x7f6600a5f943]
  [bt] (5) /tmp/libxgboost4j169322301632920248.so(Java_ml_dmlc_xgboost4j_java_XGBoostJNI_XGDMatrixCreateFromDataIter+0x96) [0x7f6600a57426]
  [bt] (6) [0x7f68ad018427]

,[Ljava.lang.StackTraceElement;@2c0925ec,ml.dmlc.xgboost4j.java.XGBoostError: [09:39:34] /workspace/jvm-packages/xgboost4j/src/native/xgboost4j.cpp:159: [09:39:34] /workspace/jvm-packages/xgboost4j/src/native/xgboost4j.cpp:78: Check failed: jenv->ExceptionOccurred(): 
Stack trace:
  [bt] (0) /tmp/libxgboost4j169322301632920248.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x57) [0x7f6600a5e947]
  [bt] (1) /tmp/libxgboost4j169322301632920248.so(XGBoost4jCallbackDataIterNext+0x2d55) [0x7f6600a5b595]
  [bt] (2) /tmp/libxgboost4j169322301632920248.so(xgboost::data::SimpleDMatrix::SimpleDMatrix<xgboost::data::IteratorAdapter>(xgboost::data::IteratorAdapter*, float, int)+0x2c0) [0x7f6600b1cda0]
  [bt] (3) /tmp/libxgboost4j169322301632920248.so(xgboost::DMatrix* xgboost::DMatrix::Create<xgboost::data::IteratorAdapter>(xgboost::data::IteratorAdapter*, float, int, std::string const&, unsigned long)+0x45) [0x7f6600b11d15]
  [bt] (4) /tmp/libxgboost4j169322301632920248.so(XGDMatrixCreateFromDataIter+0x153) [0x7f6600a5f943]
  [bt] (5) /tmp/libxgboost4j169322301632920248.so(Java_ml_dmlc_xgboost4j_java_XGBoostJNI_XGDMatrixCreateFromDataIter+0x96) [0x7f6600a57426]
  [bt] (6) [0x7f68ad018427]

Stack trace:
  [bt] (0) /tmp/libxgboost4j169322301632920248.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x57) [0x7f6600a5e947]
  [bt] (1) /tmp/libxgboost4j169322301632920248.so(XGBoost4jCallbackDataIterNext+0x2664) [0x7f6600a5aea4]
  [bt] (2) /tmp/libxgboost4j169322301632920248.so(xgboost::data::SimpleDMatrix::SimpleDMatrix<xgboost::data::IteratorAdapter>(xgboost::data::IteratorAdapter*, float, int)+0x2c0) [0x7f6600b1cda0]
  [bt] (3) /tmp/libxgboost4j169322301632920248.so(xgboost::DMatrix* xgboost::DMatrix::Create<xgboost::data::IteratorAdapter>(xgboost::data::IteratorAdapter*, float, int, std::string const&, unsigned long)+0x45) [0x7f6600b11d15]
  [bt] (4) /tmp/libxgboost4j169322301632920248.so(XGDMatrixCreateFromDataIter+0x153) [0x7f6600a5f943]
  [bt] (5) /tmp/libxgboost4j169322301632920248.so(Java_ml_dmlc_xgboost4j_java_XGBoostJNI_XGDMatrixCreateFromDataIter+0x96) [0x7f6600a57426]
  [bt] (6) [0x7f68ad018427]

    at ml.dmlc.xgboost4j.java.XGBoostJNI.checkCall(XGBoostJNI.java:48)
    at ml.dmlc.xgboost4j.java.DMatrix.<init>(DMatrix.java:54)
    at ml.dmlc.xgboost4j.scala.DMatrix.<init>(DMatrix.scala:42)
    at ml.dmlc.xgboost4j.scala.spark.Watches$.buildWatches(XGBoost.scala:790)
    at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainForNonRanking$1.apply(XGBoost.scala:451)
    at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainForNonRanking$1.apply(XGBoost.scala:450)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:823)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:823)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
    at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:359)
    at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:357)
    at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)
    at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
    at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
    at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
    at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
    at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:357)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:308)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:123)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
hcho3 commented 3 years ago

The next() method of the Scala iterator somehow failed to return a valid data batch.

https://github.com/dmlc/xgboost/blob/0d483cb7c134977874e19bffd16871a483820825/jvm-packages/xgboost4j/src/native/xgboost4j.cpp#L76-L81

Can you double-check if the data iterator is working properly? Maybe one of the partitions is empty?

monicasenapati commented 3 years ago

This is my parameter map and the pipeline:

val xgbParam = Map("missing" -> 0,
          "max_depth" -> 4,
          "objective" -> "binary:logistic",
          "num_round" -> 40,
          "colsample_bytree" -> 0.3,
          "subsample"->0.5,
          "num_workers" -> 450,
          "tree_method" -> "hist")
// Create XGBoost classifier & set the features vector and set the label column
      val xgb =new XGBoostClassifier(xgbParam).setLabelCol("isPossiblySensitive").setFeaturesCol("features")
      // Hyperparameter grid for alpha, lambda and subsample parameters
      val xgbParamGrid = new ParamGridBuilder()
      .addGrid(xgb.eta, Array(0.01, 0.1))
      .addGrid(xgb.maxDepth,Array(2,4))
hcho3 commented 3 years ago

Can you post the full log?

Here is the next() method of the data iterator, and it returns null and logs an error message if something goes wrong: https://github.com/dmlc/xgboost/blob/ad826e913ff62da80cdf1f71fb247d02e6641c83/jvm-packages/xgboost4j/src/main/java/ml/dmlc/xgboost4j/java/DataBatch.java#L104-L108

monicasenapati commented 3 years ago

xgb.2.log Here's the full log

hcho3 commented 3 years ago

The log doesn't give us much clue yet. Is it possible to post data and the training script?

monicasenapati commented 3 years ago

The data is way too large to post on Github. Thats why it is stored in HDFS in a cluster. I can post a small part of it may be. Will that work?

hcho3 commented 3 years ago

Do you run into the same error when you use small amount of data?

monicasenapati commented 3 years ago

No. Following is the schema of the dataframe I am using.

root
 |-- userID: long (nullable = true)
 |-- isVerified: integer (nullable = true)
 |-- friendsCount: integer (nullable = true)
 |-- followersCount: integer (nullable = true)
 |-- statusesCount: integer (nullable = true)
 |-- retweetCount: integer (nullable = true)
 |-- isPossiblySensitive: double (nullable = true)
 |-- containsHashtag_onehotencoded: vector (nullable = true)
 |-- containsLink_onehotencoded: vector (nullable = true)
 |-- retweeted_onehotencoded: vector (nullable = true)
 |-- mentions_onehotencoded: vector (nullable = true)
 |-- friend_onehotencoded: vector (nullable = true)
 |-- isFollowedBy_onehotencoded: vector (nullable = true)

It works fine with all columns except "friend_onehotencoded" and "isFollowedBy_onehotencoded". The error occurs if I include these two columns.

hcho3 commented 3 years ago

If you keep the same set of columns but only reduce the number of rows, does the error disappear?

So far, I have zero idea as to what's causing the error. And it appears that having a small sample of the data won't help, if the error only happens with the full data present.

monicasenapati commented 3 years ago

Got it. I took a sample of 1600 rows to test it out. The error continues to occur for all set of columns on this sample of 1600 rows.

hcho3 commented 3 years ago

The error continues to occur for all set of columns on this sample of 1600 rows.

In that case, would you be able to post the 1600 row sample, as well as the training script?

monicasenapati commented 3 years ago

ErrorSample.zip This contains a training script and sample data I am trying to train on.

hcho3 commented 3 years ago

Great! I will try to reproduce the error on my end and investigate the root cause.

monicasenapati commented 3 years ago

Thanks! Now that I know it could be due to missing value, I will investigate too. I will update you if I find something. I shall look forward to hearing back from you.

monicasenapati commented 3 years ago

Great! I will try to reproduce the error on my end and investigate the root cause.

@hcho3 Thank you so much for your time. I appreciate it. I was able to surpass that issue now. I discovered it was a bug in my code that was not parsing the input CSV files as I intended them to be. This current issue now appears to be fixed. I am running into another spark error. I will have to fix that.

trivialfis commented 3 years ago

Thanks for following up on this. Do you see there's way we can add proper checks in xgboost to prevent similar errors in the future?

monicasenapati commented 3 years ago

One way I can think of if there could be a way to point out the validity of the data that goes into the model. In my case, it was because there seemed to be a trailing string remaining, because of an error in my code, for one of the columns when being parsed. That could help, I guess, knowing beforehand.

monicasenapati commented 3 years ago

Now I get a Spark error which looks like this:

20/12/18 13:30:33 INFO RabitTracker$TrackerProcessLogger: 2020-12-18 13:30:33,598 INFO @tracker All of 300 nodes getting started
20/12/18 13:30:38 ERROR XGBoostTaskFailedListener: Training Task Failed during XGBoost Training: ExecutorLostFailure(49,true,Some(Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.)), stopping SparkContext
20/12/18 13:30:38 ERROR XGBoostTaskFailedListener: Training Task Failed during XGBoost Training: ExecutorLostFailure(49,true,Some(Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.)), stopping SparkContext
20/12/18 13:30:38 ERROR XGBoostTaskFailedListener: Training Task Failed during XGBoost Training: ExecutorLostFailure(49,true,Some(Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.)), stopping SparkContext
20/12/18 13:30:38 ERROR RabitTracker: Uncaught exception thrown by worker:
org.apache.spark.SparkException: Job 9 cancelled because SparkContext was shut down
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:933)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:931)
        at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
        at org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:931)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:2130)
        at org.apache.spark.util.EventLoop.stop(EventLoop.scala:84)
        at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:2043)
        at org.apache.spark.SparkContext$$anonfun$stop$6.apply$mcV$sp(SparkContext.scala:1949)
        at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1340)
        at org.apache.spark.SparkContext.stop(SparkContext.scala:1948)
        at org.apache.spark.TaskFailedListener$$anon$1$$anonfun$run$1.apply$mcV$sp(SparkParallelismTracker.scala:119)
        at org.apache.spark.TaskFailedListener$$anon$1$$anonfun$run$1.apply(SparkParallelismTracker.scala:119)
        at org.apache.spark.TaskFailedListener$$anon$1$$anonfun$run$1.apply(SparkParallelismTracker.scala:119)
        at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
        at org.apache.spark.TaskFailedListener$$anon$1.run(SparkParallelismTracker.scala:118)
        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:738)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
        at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:980)
        at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:978)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:385)
        at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:978)
        at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anon$1.run(XGBoost.scala:565)
java.util.concurrent.RejectedExecutionException: Task scala.concurrent.impl.CallbackRunnable@255cf508 rejected from java.util.concurrent.ThreadPoolExecutor@10096975[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 23482]

I am not sure, but does this make sense to you?

jmpanfil commented 3 years ago

Was this ever resolved? I am experiencing a similar issue:

%scala
package ml.dmlc.xgboost4j.scala.spark2
import ml.dmlc.xgboost4j.scala.Booster
import ml.dmlc.xgboost4j.scala.spark.XGBoostRegressionModel
class XGBoostRegBridge(
    uid: String,
    _booster: Booster) {
  val xgbRegressionModel = new XGBoostRegressionModel(uid, _booster)
}

import ml.dmlc.xgboost4j.scala.spark2._
import ml.dmlc.xgboost4j.scala.XGBoost
val model = XGBoost.loadModel("/dbfs/FileStore/tmp/xgb53.model")
val bri = new XGBoostRegBridge("uid", model)
bri.xgbRegressionModel.setFeaturesCol("feature_vector")
var pred = bri.xgbRegressionModel.transform(train_sparse)
pred.show()

Job aborted due to stage failure.
Caused by: XGBoostError: [17:36:06] /workspace/jvm-packages/xgboost4j/src/native/xgboost4j.cpp:159: [17:36:06] /workspace/jvm-packages/xgboost4j/src/native/xgboost4j.cpp:78: Check failed: jenv->ExceptionOccurred(): 
Stack trace:
  [bt] (0) /local_disk0/tmp/libxgboost4j3687488462117693459.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x53) [0x7f0ff8810843]
  [bt] (1) /local_disk0/tmp/libxgboost4j3687488462117693459.so(XGBoost4jCallbackDataIterNext+0xd10) [0x7f0ff880d960]
  [bt] (2) /local_disk0/tmp/libxgboost4j3687488462117693459.so(xgboost::data::SimpleDMatrix::SimpleDMatrix<xgboost::data::IteratorAdapter<void*, int (void*, int (*)(void*, XGBoostBatchCSR), void*), XGBoostBatchCSR> >(xgboost::data::IteratorAdapter<void*, int (void*, int (*)(void*, XGBoostBatchCSR), void*), XGBoostBatchCSR>*, float, int)+0x2f8) [0x7f0ff8902268]
  [bt] (3) /local_disk0/tmp/libxgboost4j3687488462117693459.so(xgboost::DMatrix* xgboost::DMatrix::Create<xgboost::data::IteratorAdapter<void*, int (void*, int (*)(void*, XGBoostBatchCSR), void*), XGBoostBatchCSR> >(xgboost::data::IteratorAdapter<void*, int (void*, int (*)(void*, XGBoostBatchCSR), void*), XGBoostBatchCSR>*, float, int, std::string const&, unsigned long)+0x45) [0x7f0ff88f79b5]
  [bt] (4) /local_disk0/tmp/libxgboost4j3687488462117693459.so(XGDMatrixCreateFromDataIter+0x152) [0x7f0ff881e682]
  [bt] (5) /local_disk0/tmp/libxgboost4j3687488462117693459.so(Java_ml_dmlc_xgboost4j_java_XGBoostJNI_XGDMatrixCreateFromDataIter+0x96) [0x7f0ff880b7b6]
  [bt] (6) [0x7f1020017ee7]

Stack trace:
  [bt] (0) /local_disk0/tmp/libxgboost4j3687488462117693459.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x53) [0x7f0ff8810843]
  [bt] (1) /local_disk0/tmp/libxgboost4j3687488462117693459.so(XGBoost4jCallbackDataIterNext+0xdc4) [0x7f0ff880da14]
  [bt] (2) /local_disk0/tmp/libxgboost4j3687488462117693459.so(xgboost::data::SimpleDMatrix::SimpleDMatrix<xgboost::data::IteratorAdapter<void*, int (void*, int (*)(void*, XGBoostBatchCSR), void*), XGBoostBatchCSR> >(xgboost::data::IteratorAdapter<void*, int (void*, int (*)(void*, XGBoostBatchCSR), void*), XGBoostBatchCSR>*, float, int)+0x2f8) [0x7f0ff8902268]
  [bt] (3) /local_disk0/tmp/libxgboost4j3687488462117693459.so(xgboost::DMatrix* xgboost::DMatrix::Create<xgboost::data::IteratorAdapter<void*, int (void*, int (*)(void*, XGBoostBatchCSR), void*), XGBoostBatchCSR> >(xgboost::data::IteratorAdapter<void*, int (void*, int (*)(void*, XGBoostBatchCSR), void*), XGBoostBatchCSR>*, float, int, std::string const&, unsigned long)+0x45) [0x7f0ff88f79b5]
  [bt] (4) /local_disk0/tmp/libxgboost4j3687488462117693459.so(XGDMatrixCreateFromDataIter+0x152) [0x7f0ff881e682]
  [bt] (5) /local_disk0/tmp/libxgboost4j3687488462117693459.so(Java_ml_dmlc_xgboost4j_java_XGBoostJNI_XGDMatrixCreateFromDataIter+0x96) [0x7f0ff880b7b6]
  [bt] (6) [0x7f1020017ee7]

Trying on even just one row doesn’t fix it. We can see that the data itself is fine:

train_sparse.filter("ID == 1").show(false)
+-----------+------------------------------------------+
|ID|feature_vector                            |
+-----------+------------------------------------------+
|1          |(4056,[0,1,1097,2250],[26.0,1.0,1.0,57.0])|
+-----------+------------------------------------------+
trivialfis commented 3 years ago

@jmpanfil The log you shared indicates an exception was thrown in jvm during data iteration. (xgboost fetches data from java iterator)

jmpanfil commented 3 years ago

@trivialfis any guidance on how to troubleshoot this?

dchristle commented 3 years ago

@jmpanfil

Can you verify your target label is a Double rather than some other type like an Integer? This fixed the error for me.

jmpanfil commented 3 years ago

@dchristle

Which target label are you referring to? This isn't the training data so I don't have labels, just an ID column to identity the subject and the sparse vector column with the feature data.

dchristle commented 3 years ago

@dchristle

Which target label are you referring to? This isn't the training data so I don't have labels, just an ID column to identity the subject and the sparse vector column with the feature data.

I see. Can you try explicitly casting all of your input features to doubles, just to be sure?

jmpanfil commented 3 years ago

Oh my mistake I thought I included the code for the sparse vector but I did not. Here's that code with some data in the same format as mine.

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.ml.linalg.{Vector, Vectors}
val data = Seq(("f1", 1, 1.0, 0, 1097), ("f2", 1, 57.0, 0, 2250), ("f3", 1, 1.0, 0, 1))
val df = spark.sparkContext.parallelize(data).toDF()
val n_col = 4056
df.printSchema
var train_sparse = spark.createDataFrame(model_dt.rdd.map(r => (r.getInt(1), (r.getInt(4), r.getDouble(2))))).groupByKey().map(r => (r._1, Vectors.sparse(n_col, r._2.toSeq))))
monicasenapati commented 3 years ago

Doesn’t XGBoost support Int and long type data anymore? I was able to run on these types last year.

jmpanfil commented 3 years ago

@monicasenapati I am trying to use the xgboost4j-spark library which has a different API than the xgboost4j library. The xgboost4j-spark transform function uses distributed computing for predictions, which is necessary for me due to size of my data. My data is highly sparse and in long format so I'm also trying to avoid high memory cost operations on my data. I'm definitely open to alternatives

monicasenapati commented 3 years ago

@monicasenapati I am trying to use the xgboost4j-spark library which has a different API than the xgboost4j library. The xgboost4j-spark transform function uses distributed computing for predictions, which is necessary for me due to size of my data. My data is highly sparse and in long format so I'm also trying to avoid high memory cost operations on my data. I'm definitely open to alternatives

@jmpanfil I too use xgboost4j-spark and have similar data. Highly sparse and very large. Not sure if the data type could be an issue though. If you find something please do let me know since I am having a roadblock too. Thank you!

dchristle commented 3 years ago

Hi all,

I ran into this same bug even though I previously thought I had fixed it. Adding "missing" -> 0.0 to the XGBoostClassifier params seemed to fix it. Previously, I had this set to -1.0, which did not work. Removing it triggered an error, which indicated that it was set to NaN.

@jmpanfil @monicasenapati Can you try adding "missing" -> 0.0 to your params map?

jmpanfil commented 3 years ago

@dchristle would you be able to share the code you used to update params on a loaded model?

jmpanfil commented 3 years ago

@dchristle ignore my last question. I forgot there is .setMissing.

var bri = new XGBoostRegBridge("uid", model)
  bri.xgbRegressionModel.setFeaturesCol("feature_vector")
  bri.xgbRegressionModel.setMissing(0.0F)
  var pred = bri.xgbRegressionModel.transform(train_sparse)

The above code works! I can't believe it was that simple. I will have to make sure the predictions are coming out as expected, but it does work! @monicasenapati

monicasenapati commented 3 years ago

Hi all,

I ran into this same bug even though I previously thought I had fixed it. Adding "missing" -> 0.0 to the XGBoostClassifier params seemed to fix it. Previously, I had this set to -1.0, which did not work. Removing it triggered an error, which indicated that it was set to NaN.

@jmpanfil @monicasenapati Can you try adding "missing" -> 0.0 to your params map?

I have the following param map:

val xgbParam = Map( "missing" -> 0.0,
     "eta" -> 0.1f, //0.01f,
      "max_depth" -> 2, 
      "objective" -> "binary:logistic",
      "num_round" -> 10, 
      "num_early_stopping_rounds" -> 5,
      "scale_pos_weight" -> 104, 
      "num_workers" -> numberofpartitions, 
      "tree_method" -> "hist")

I am still getting an error:

2021-11-13 01:15:07,783 INFO java.RabitTracker$TrackerProcessLogger:   File "/usr/lib/python3.6/threading.py", line 864, in run
2021-11-13 01:15:07,783 INFO java.RabitTracker$TrackerProcessLogger:     self._target(*self._args, **self._kwargs)
2021-11-13 01:15:07,783 INFO java.RabitTracker$TrackerProcessLogger:   File "/tmp/tracker13545953339351710546.py", line 324, in run
2021-11-13 01:15:07,783 INFO java.RabitTracker$TrackerProcessLogger:     self.accept_slaves(nslave)
2021-11-13 01:15:07,783 INFO java.RabitTracker$TrackerProcessLogger:   File "/tmp/tracker13545953339351710546.py", line 268, in accept_slaves
2021-11-13 01:15:07,783 INFO java.RabitTracker$TrackerProcessLogger:     s = SlaveEntry(fd, s_addr)
2021-11-13 01:15:07,783 INFO java.RabitTracker$TrackerProcessLogger:   File "/tmp/tracker13545953339351710546.py", line 64, in __init__
2021-11-13 01:15:07,783 INFO java.RabitTracker$TrackerProcessLogger:     assert magic == kMagic, 'invalid magic number=%d from %s' % (magic, self.host)
2021-11-13 01:15:07,783 INFO java.RabitTracker$TrackerProcessLogger: AssertionError: invalid magic number=542393671 from 92.118.161.13
2021-11-13 01:15:07,783 INFO java.RabitTracker$TrackerProcessLogger:
2021-11-13 01:15:07,790 INFO java.RabitTracker: Tracker Process ends with exit code 0
2021-11-13 01:15:07,790 INFO java.RabitTracker$TrackerProcessLogger: Tracker Process ends with exit code 0
2021-11-13 01:15:07,795 INFO XGBoostSpark: Rabit returns with exit code 0
2021-11-14 00:11:41,044 ERROR java.RabitTracker: Uncaught exception thrown by worker: 
org.apache.spark.SparkException: Job aborted due to stage failure: Task 15 in stage 33.0 failed 4 times, most recent failure: Lost task 15.3 in stage 33.0 (TID 82650) (10.10.1.27 executor 5): java.lang.NoClassDefFoundError: Could not initialize class ml.dmlc.xgboost4j.java.XGBoostJNI
        at ml.dmlc.xgboost4j.java.DMatrix.<init>(DMatrix.java:54)
        at ml.dmlc.xgboost4j.scala.DMatrix.<init>(DMatrix.scala:42)
        at ml.dmlc.xgboost4j.scala.spark.Watches$.buildWatches(XGBoost.scala:843)
        at ml.dmlc.xgboost4j.scala.spark.XGBoost$.$anonfun$trainForNonRanking$1(XGBoost.scala:497)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:863)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:863)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:386)
        at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1423)
        at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1350)
        at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1414)