Open monicasenapati opened 3 years ago
The next()
method of the Scala iterator somehow failed to return a valid data batch.
Can you double-check if the data iterator is working properly? Maybe one of the partitions is empty?
This is my parameter map and the pipeline:
val xgbParam = Map("missing" -> 0,
"max_depth" -> 4,
"objective" -> "binary:logistic",
"num_round" -> 40,
"colsample_bytree" -> 0.3,
"subsample"->0.5,
"num_workers" -> 450,
"tree_method" -> "hist")
// Create XGBoost classifier & set the features vector and set the label column
val xgb =new XGBoostClassifier(xgbParam).setLabelCol("isPossiblySensitive").setFeaturesCol("features")
// Hyperparameter grid for alpha, lambda and subsample parameters
val xgbParamGrid = new ParamGridBuilder()
.addGrid(xgb.eta, Array(0.01, 0.1))
.addGrid(xgb.maxDepth,Array(2,4))
Can you post the full log?
Here is the next()
method of the data iterator, and it returns null
and logs an error message if something goes wrong:
https://github.com/dmlc/xgboost/blob/ad826e913ff62da80cdf1f71fb247d02e6641c83/jvm-packages/xgboost4j/src/main/java/ml/dmlc/xgboost4j/java/DataBatch.java#L104-L108
xgb.2.log Here's the full log
The log doesn't give us much clue yet. Is it possible to post data and the training script?
The data is way too large to post on Github. Thats why it is stored in HDFS in a cluster. I can post a small part of it may be. Will that work?
Do you run into the same error when you use small amount of data?
No. Following is the schema of the dataframe I am using.
root
|-- userID: long (nullable = true)
|-- isVerified: integer (nullable = true)
|-- friendsCount: integer (nullable = true)
|-- followersCount: integer (nullable = true)
|-- statusesCount: integer (nullable = true)
|-- retweetCount: integer (nullable = true)
|-- isPossiblySensitive: double (nullable = true)
|-- containsHashtag_onehotencoded: vector (nullable = true)
|-- containsLink_onehotencoded: vector (nullable = true)
|-- retweeted_onehotencoded: vector (nullable = true)
|-- mentions_onehotencoded: vector (nullable = true)
|-- friend_onehotencoded: vector (nullable = true)
|-- isFollowedBy_onehotencoded: vector (nullable = true)
It works fine with all columns except "friend_onehotencoded
" and "isFollowedBy_onehotencoded
". The error occurs if I include these two columns.
If you keep the same set of columns but only reduce the number of rows, does the error disappear?
So far, I have zero idea as to what's causing the error. And it appears that having a small sample of the data won't help, if the error only happens with the full data present.
Got it. I took a sample of 1600 rows to test it out. The error continues to occur for all set of columns on this sample of 1600 rows.
The error continues to occur for all set of columns on this sample of 1600 rows.
In that case, would you be able to post the 1600 row sample, as well as the training script?
ErrorSample.zip This contains a training script and sample data I am trying to train on.
Great! I will try to reproduce the error on my end and investigate the root cause.
Thanks! Now that I know it could be due to missing value, I will investigate too. I will update you if I find something. I shall look forward to hearing back from you.
Great! I will try to reproduce the error on my end and investigate the root cause.
@hcho3 Thank you so much for your time. I appreciate it. I was able to surpass that issue now. I discovered it was a bug in my code that was not parsing the input CSV files as I intended them to be. This current issue now appears to be fixed. I am running into another spark error. I will have to fix that.
Thanks for following up on this. Do you see there's way we can add proper checks in xgboost to prevent similar errors in the future?
One way I can think of if there could be a way to point out the validity of the data that goes into the model. In my case, it was because there seemed to be a trailing string remaining, because of an error in my code, for one of the columns when being parsed. That could help, I guess, knowing beforehand.
Now I get a Spark error which looks like this:
20/12/18 13:30:33 INFO RabitTracker$TrackerProcessLogger: 2020-12-18 13:30:33,598 INFO @tracker All of 300 nodes getting started
20/12/18 13:30:38 ERROR XGBoostTaskFailedListener: Training Task Failed during XGBoost Training: ExecutorLostFailure(49,true,Some(Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.)), stopping SparkContext
20/12/18 13:30:38 ERROR XGBoostTaskFailedListener: Training Task Failed during XGBoost Training: ExecutorLostFailure(49,true,Some(Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.)), stopping SparkContext
20/12/18 13:30:38 ERROR XGBoostTaskFailedListener: Training Task Failed during XGBoost Training: ExecutorLostFailure(49,true,Some(Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.)), stopping SparkContext
20/12/18 13:30:38 ERROR RabitTracker: Uncaught exception thrown by worker:
org.apache.spark.SparkException: Job 9 cancelled because SparkContext was shut down
at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:933)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:931)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
at org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:931)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:2130)
at org.apache.spark.util.EventLoop.stop(EventLoop.scala:84)
at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:2043)
at org.apache.spark.SparkContext$$anonfun$stop$6.apply$mcV$sp(SparkContext.scala:1949)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1340)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1948)
at org.apache.spark.TaskFailedListener$$anon$1$$anonfun$run$1.apply$mcV$sp(SparkParallelismTracker.scala:119)
at org.apache.spark.TaskFailedListener$$anon$1$$anonfun$run$1.apply(SparkParallelismTracker.scala:119)
at org.apache.spark.TaskFailedListener$$anon$1$$anonfun$run$1.apply(SparkParallelismTracker.scala:119)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at org.apache.spark.TaskFailedListener$$anon$1.run(SparkParallelismTracker.scala:118)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:738)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:980)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:978)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:385)
at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:978)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anon$1.run(XGBoost.scala:565)
java.util.concurrent.RejectedExecutionException: Task scala.concurrent.impl.CallbackRunnable@255cf508 rejected from java.util.concurrent.ThreadPoolExecutor@10096975[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 23482]
I am not sure, but does this make sense to you?
Was this ever resolved? I am experiencing a similar issue:
%scala
package ml.dmlc.xgboost4j.scala.spark2
import ml.dmlc.xgboost4j.scala.Booster
import ml.dmlc.xgboost4j.scala.spark.XGBoostRegressionModel
class XGBoostRegBridge(
uid: String,
_booster: Booster) {
val xgbRegressionModel = new XGBoostRegressionModel(uid, _booster)
}
import ml.dmlc.xgboost4j.scala.spark2._
import ml.dmlc.xgboost4j.scala.XGBoost
val model = XGBoost.loadModel("/dbfs/FileStore/tmp/xgb53.model")
val bri = new XGBoostRegBridge("uid", model)
bri.xgbRegressionModel.setFeaturesCol("feature_vector")
var pred = bri.xgbRegressionModel.transform(train_sparse)
pred.show()
Job aborted due to stage failure.
Caused by: XGBoostError: [17:36:06] /workspace/jvm-packages/xgboost4j/src/native/xgboost4j.cpp:159: [17:36:06] /workspace/jvm-packages/xgboost4j/src/native/xgboost4j.cpp:78: Check failed: jenv->ExceptionOccurred():
Stack trace:
[bt] (0) /local_disk0/tmp/libxgboost4j3687488462117693459.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x53) [0x7f0ff8810843]
[bt] (1) /local_disk0/tmp/libxgboost4j3687488462117693459.so(XGBoost4jCallbackDataIterNext+0xd10) [0x7f0ff880d960]
[bt] (2) /local_disk0/tmp/libxgboost4j3687488462117693459.so(xgboost::data::SimpleDMatrix::SimpleDMatrix<xgboost::data::IteratorAdapter<void*, int (void*, int (*)(void*, XGBoostBatchCSR), void*), XGBoostBatchCSR> >(xgboost::data::IteratorAdapter<void*, int (void*, int (*)(void*, XGBoostBatchCSR), void*), XGBoostBatchCSR>*, float, int)+0x2f8) [0x7f0ff8902268]
[bt] (3) /local_disk0/tmp/libxgboost4j3687488462117693459.so(xgboost::DMatrix* xgboost::DMatrix::Create<xgboost::data::IteratorAdapter<void*, int (void*, int (*)(void*, XGBoostBatchCSR), void*), XGBoostBatchCSR> >(xgboost::data::IteratorAdapter<void*, int (void*, int (*)(void*, XGBoostBatchCSR), void*), XGBoostBatchCSR>*, float, int, std::string const&, unsigned long)+0x45) [0x7f0ff88f79b5]
[bt] (4) /local_disk0/tmp/libxgboost4j3687488462117693459.so(XGDMatrixCreateFromDataIter+0x152) [0x7f0ff881e682]
[bt] (5) /local_disk0/tmp/libxgboost4j3687488462117693459.so(Java_ml_dmlc_xgboost4j_java_XGBoostJNI_XGDMatrixCreateFromDataIter+0x96) [0x7f0ff880b7b6]
[bt] (6) [0x7f1020017ee7]
Stack trace:
[bt] (0) /local_disk0/tmp/libxgboost4j3687488462117693459.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x53) [0x7f0ff8810843]
[bt] (1) /local_disk0/tmp/libxgboost4j3687488462117693459.so(XGBoost4jCallbackDataIterNext+0xdc4) [0x7f0ff880da14]
[bt] (2) /local_disk0/tmp/libxgboost4j3687488462117693459.so(xgboost::data::SimpleDMatrix::SimpleDMatrix<xgboost::data::IteratorAdapter<void*, int (void*, int (*)(void*, XGBoostBatchCSR), void*), XGBoostBatchCSR> >(xgboost::data::IteratorAdapter<void*, int (void*, int (*)(void*, XGBoostBatchCSR), void*), XGBoostBatchCSR>*, float, int)+0x2f8) [0x7f0ff8902268]
[bt] (3) /local_disk0/tmp/libxgboost4j3687488462117693459.so(xgboost::DMatrix* xgboost::DMatrix::Create<xgboost::data::IteratorAdapter<void*, int (void*, int (*)(void*, XGBoostBatchCSR), void*), XGBoostBatchCSR> >(xgboost::data::IteratorAdapter<void*, int (void*, int (*)(void*, XGBoostBatchCSR), void*), XGBoostBatchCSR>*, float, int, std::string const&, unsigned long)+0x45) [0x7f0ff88f79b5]
[bt] (4) /local_disk0/tmp/libxgboost4j3687488462117693459.so(XGDMatrixCreateFromDataIter+0x152) [0x7f0ff881e682]
[bt] (5) /local_disk0/tmp/libxgboost4j3687488462117693459.so(Java_ml_dmlc_xgboost4j_java_XGBoostJNI_XGDMatrixCreateFromDataIter+0x96) [0x7f0ff880b7b6]
[bt] (6) [0x7f1020017ee7]
Trying on even just one row doesn’t fix it. We can see that the data itself is fine:
train_sparse.filter("ID == 1").show(false)
+-----------+------------------------------------------+
|ID|feature_vector |
+-----------+------------------------------------------+
|1 |(4056,[0,1,1097,2250],[26.0,1.0,1.0,57.0])|
+-----------+------------------------------------------+
@jmpanfil The log you shared indicates an exception was thrown in jvm during data iteration. (xgboost fetches data from java iterator)
@trivialfis any guidance on how to troubleshoot this?
@jmpanfil
Can you verify your target label is a Double
rather than some other type like an Integer
? This fixed the error for me.
@dchristle
Which target label are you referring to? This isn't the training data so I don't have labels, just an ID column to identity the subject and the sparse vector column with the feature data.
@dchristle
Which target label are you referring to? This isn't the training data so I don't have labels, just an ID column to identity the subject and the sparse vector column with the feature data.
I see. Can you try explicitly casting all of your input features to doubles, just to be sure?
Oh my mistake I thought I included the code for the sparse vector but I did not. Here's that code with some data in the same format as mine.
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.ml.linalg.{Vector, Vectors}
val data = Seq(("f1", 1, 1.0, 0, 1097), ("f2", 1, 57.0, 0, 2250), ("f3", 1, 1.0, 0, 1))
val df = spark.sparkContext.parallelize(data).toDF()
val n_col = 4056
df.printSchema
var train_sparse = spark.createDataFrame(model_dt.rdd.map(r => (r.getInt(1), (r.getInt(4), r.getDouble(2))))).groupByKey().map(r => (r._1, Vectors.sparse(n_col, r._2.toSeq))))
Doesn’t XGBoost support Int and long type data anymore? I was able to run on these types last year.
@monicasenapati I am trying to use the xgboost4j-spark
library which has a different API than the xgboost4j
library. The xgboost4j-spark
transform
function uses distributed computing for predictions, which is necessary for me due to size of my data. My data is highly sparse and in long format so I'm also trying to avoid high memory cost operations on my data. I'm definitely open to alternatives
@monicasenapati I am trying to use the
xgboost4j-spark
library which has a different API than thexgboost4j
library. Thexgboost4j-spark
transform
function uses distributed computing for predictions, which is necessary for me due to size of my data. My data is highly sparse and in long format so I'm also trying to avoid high memory cost operations on my data. I'm definitely open to alternatives
@jmpanfil I too use xgboost4j-spark
and have similar data. Highly sparse and very large. Not sure if the data type could be an issue though. If you find something please do let me know since I am having a roadblock too. Thank you!
Hi all,
I ran into this same bug even though I previously thought I had fixed it. Adding "missing" -> 0.0
to the XGBoostClassifier
params seemed to fix it. Previously, I had this set to -1.0
, which did not work. Removing it triggered an error, which indicated that it was set to NaN
.
@jmpanfil @monicasenapati Can you try adding "missing" -> 0.0
to your params map?
@dchristle would you be able to share the code you used to update params on a loaded model?
@dchristle ignore my last question. I forgot there is .setMissing
.
var bri = new XGBoostRegBridge("uid", model)
bri.xgbRegressionModel.setFeaturesCol("feature_vector")
bri.xgbRegressionModel.setMissing(0.0F)
var pred = bri.xgbRegressionModel.transform(train_sparse)
The above code works! I can't believe it was that simple. I will have to make sure the predictions are coming out as expected, but it does work! @monicasenapati
Hi all,
I ran into this same bug even though I previously thought I had fixed it. Adding
"missing" -> 0.0
to theXGBoostClassifier
params seemed to fix it. Previously, I had this set to-1.0
, which did not work. Removing it triggered an error, which indicated that it was set toNaN
.@jmpanfil @monicasenapati Can you try adding
"missing" -> 0.0
to your params map?
I have the following param map:
val xgbParam = Map( "missing" -> 0.0,
"eta" -> 0.1f, //0.01f,
"max_depth" -> 2,
"objective" -> "binary:logistic",
"num_round" -> 10,
"num_early_stopping_rounds" -> 5,
"scale_pos_weight" -> 104,
"num_workers" -> numberofpartitions,
"tree_method" -> "hist")
I am still getting an error:
2021-11-13 01:15:07,783 INFO java.RabitTracker$TrackerProcessLogger: File "/usr/lib/python3.6/threading.py", line 864, in run
2021-11-13 01:15:07,783 INFO java.RabitTracker$TrackerProcessLogger: self._target(*self._args, **self._kwargs)
2021-11-13 01:15:07,783 INFO java.RabitTracker$TrackerProcessLogger: File "/tmp/tracker13545953339351710546.py", line 324, in run
2021-11-13 01:15:07,783 INFO java.RabitTracker$TrackerProcessLogger: self.accept_slaves(nslave)
2021-11-13 01:15:07,783 INFO java.RabitTracker$TrackerProcessLogger: File "/tmp/tracker13545953339351710546.py", line 268, in accept_slaves
2021-11-13 01:15:07,783 INFO java.RabitTracker$TrackerProcessLogger: s = SlaveEntry(fd, s_addr)
2021-11-13 01:15:07,783 INFO java.RabitTracker$TrackerProcessLogger: File "/tmp/tracker13545953339351710546.py", line 64, in __init__
2021-11-13 01:15:07,783 INFO java.RabitTracker$TrackerProcessLogger: assert magic == kMagic, 'invalid magic number=%d from %s' % (magic, self.host)
2021-11-13 01:15:07,783 INFO java.RabitTracker$TrackerProcessLogger: AssertionError: invalid magic number=542393671 from 92.118.161.13
2021-11-13 01:15:07,783 INFO java.RabitTracker$TrackerProcessLogger:
2021-11-13 01:15:07,790 INFO java.RabitTracker: Tracker Process ends with exit code 0
2021-11-13 01:15:07,790 INFO java.RabitTracker$TrackerProcessLogger: Tracker Process ends with exit code 0
2021-11-13 01:15:07,795 INFO XGBoostSpark: Rabit returns with exit code 0
2021-11-14 00:11:41,044 ERROR java.RabitTracker: Uncaught exception thrown by worker:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 15 in stage 33.0 failed 4 times, most recent failure: Lost task 15.3 in stage 33.0 (TID 82650) (10.10.1.27 executor 5): java.lang.NoClassDefFoundError: Could not initialize class ml.dmlc.xgboost4j.java.XGBoostJNI
at ml.dmlc.xgboost4j.java.DMatrix.<init>(DMatrix.java:54)
at ml.dmlc.xgboost4j.scala.DMatrix.<init>(DMatrix.scala:42)
at ml.dmlc.xgboost4j.scala.spark.Watches$.buildWatches(XGBoost.scala:843)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.$anonfun$trainForNonRanking$1(XGBoost.scala:497)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:863)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:863)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:386)
at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1423)
at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1350)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1414)
Hi, I have a pipeline of hyperparameter tuning, evaluator, and cross-validate on an XGBoostClassifier model. However, I run into the following issue and was wondering if I could get some help understanding what it means. Any suggestion or insight will be greatly appreciated. Also, I can provide more information on this, if required.