dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.37k stars 8.74k forks source link

Does Spark4j-Spark latest build support `reg:pseudohubererror` as objective? #5698

Closed pancodia closed 4 years ago

pancodia commented 4 years ago

Does Spark4j-Spark latest build support reg:pseudohubererror as objective?

I am trying to use XGBoost4j-Spark to build a regression model with pseudohubererror as the objective.

Environment:

I launched the spark-shell on EMR master node as:

spark-shell \
    --jars hdfs:///user/lib/xgboost4j_2.11-1.1.0.jar,hdfs:///user/lib/xgboost4j-spark_2.11-1.1.0.jar,hdfs:///user/lib/MyProjectPoc-1.0.jar \
    --conf spark.dynamicAllocation.enabled=false \
    --conf spark.executors.cores=5 \
    --conf spark.task.cpus=4 \
    --conf spark.executors.memory=16G \
    --conf spark.executor.memoryOverhead=4G \
    --conf spark.driver.memoryOverhead=4G \
    --conf spark.driver.memory=16G \
    --conf spark.driver.cores=5 \
    --conf spark.executor.instances=20 \
    --conf spark.default.parallelism=200 \
    --conf spark.sql.shuffle.partitions=200 \
    --conf spark.network.timeout=800s \
    --conf spark.executor.heartbeatInterval=60s \
    --conf spark.memory.fraction=0.80 \
    --conf spark.memory.storageFraction=0.30 \
    --conf spark.yarn.scheduler.reporterThread.maxFailures=5 \
    --conf spark.storage.level=MEMORY_AND_DISK_SER \
    --conf spark.rdd.compress=true \
    --conf spark.shuffle.compress=true \
    --conf spark.shuffle.spill.compress=true

The following is the scala code to fit model (Sorry I cannot share the dataset):

val xgbParam = Map(
    "objective" -> "reg:pseudohubererror",
    "num_round" -> 1,
    "eta" -> 0.05,
    "max_depth" -> 6,
    "missing" -> MISSING_VALUE,
    "num_workers" -> 20,
    "nthread" -> 4,
    "num_early_stopping_rounds" -> 10,
    "maximize_evaluation_metrics" -> false,
    "verbosity" -> 2
)

val xgbRegressor = {
    new XGBoostRegressor(xgbParam)
        .setLabelCol(labelColName)
        .setFeaturesCol("features")
}

val xgb_huber_model = xgbRegressor.fit(trainingInput)

The following is the error message from one of the executors:

20/05/22 16:14:38 WARN TaskSetManager: Lost task 19.1 in stage 2.0 (TID 231, ip-10-0-1-165.ec2.internal, executor 10): ml.dmlc.xgboost4j.java.XGBoostError: [16:13:44] /workspace/src/objective/objective.cc:26: Unknown objective function: `reg:pseudohubererror`
Objective candidate: survival:aft
Objective candidate: binary:hinge
Objective candidate: multi:softmax
Objective candidate: multi:softprob
Objective candidate: rank:pairwise
Objective candidate: rank:ndcg
Objective candidate: rank:map
Objective candidate: reg:squarederror
Objective candidate: reg:squaredlogerror
Objective candidate: reg:logistic
Objective candidate: binary:logistic
Objective candidate: binary:logitraw
Objective candidate: reg:linear
Objective candidate: count:poisson
Objective candidate: survival:cox
Objective candidate: reg:gamma
Objective candidate: reg:tweedie

Stack trace:
  [bt] (0) /mnt/yarn/usercache/hadoop/appcache/application_1589774373197_0126/container_1589774373197_0126_01_000015/tmp/libxgboost4j1915509404376697341.so(xgboost::ObjFunction::Create(std::string const&, xgboost::GenericParameter const*)+0x85a) [0x7f47cefafb0a]
  [bt] (1) /mnt/yarn/usercache/hadoop/appcache/application_1589774373197_0126/container_1589774373197_0126_01_000015/tmp/libxgboost4j1915509404376697341.so(xgboost::LearnerConfiguration::ConfigureObjective(xgboost::LearnerTrainParam const&, std::vector<std::pair<std::string, std::string>, std::allocator<std::pair<std::string, std::string> > >*)+0x3bc) [0x7f47cef2e69c]
  [bt] (2) /mnt/yarn/usercache/hadoop/appcache/application_1589774373197_0126/container_1589774373197_0126_01_000015/tmp/libxgboost4j1915509404376697341.so(xgboost::LearnerConfiguration::Configure()+0x4c3) [0x7f47cef3af13]
  [bt] (3) /mnt/yarn/usercache/hadoop/appcache/application_1589774373197_0126/container_1589774373197_0126_01_000015/tmp/libxgboost4j1915509404376697341.so(xgboost::LearnerImpl::UpdateOneIter(int, std::shared_ptr<xgboost::DMatrix>)+0x69) [0x7f47cef22649]
  [bt] (4) /mnt/yarn/usercache/hadoop/appcache/application_1589774373197_0126/container_1589774373197_0126_01_000015/tmp/libxgboost4j1915509404376697341.so(XGBoosterUpdateOneIter+0x59) [0x7f47cee0d1b9]
  [bt] (5) [0x7f480d018427]

    at ml.dmlc.xgboost4j.java.Rabit.checkCall(Rabit.java:53)
    at ml.dmlc.xgboost4j.java.Rabit.shutdown(Rabit.java:83)
    at ml.dmlc.xgboost4j.scala.spark.XGBoost$.ml$dmlc$xgboost4j$scala$spark$XGBoost$$buildDistributedBooster(XGBoost.scala:381)
    at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainForNonRanking$1.apply(XGBoost.scala:455)
    at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainForNonRanking$1.apply(XGBoost.scala:450)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:823)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:823)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
    at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:359)
    at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:357)
    at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1164)
    at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1155)
    at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1090)
    at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1155)
    at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:881)
    at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:357)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:308)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:123)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1405)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
hcho3 commented 4 years ago

No, the pseudo Huber loss was added after the 1.1.0 release. You should use the SNAPSHOT version instead.

pancodia commented 4 years ago

Thanks for the lightening fast reply. Is there a SNAPSHOT build for Scala 2.11 that I can download?

hcho3 commented 4 years ago

Yes, you can use either 2.11 and 2.12. Follow instructions in https://xgboost.readthedocs.io/en/latest/jvm/index.html#access-snapshot-version.

FYI, you should consider upgrading Scala to 2.12 in the near future, since XGBoost will be moving to Spark 3.0.

pancodia commented 4 years ago

Thanks

Thank you for the reminder. Since my current setup uses Spark 2.4.5 and Scala 2.11, so I have to use the build for 2.11. If my model works fine, we will probably use 2.12 build in production.

pancodia commented 4 years ago

Is there a Snapshot build that has DMLC_USE_S3=1 and DMLC_USE_HDFS=1 enabled? Do I have to compile by myself?

hcho3 commented 4 years ago

No, the snapshot builds don’t support HDFS or S3. You should compile it yourself.

pancodia commented 4 years ago

Thanks. Is there a up to date guide on how to build the snapshot? Or some sort of reference. Some guides I found online are out dated.

-- Pan

On May 29, 2020, at 15:20, Philip Hyunsu Cho notifications@github.com wrote:

 No, the snapshot builds don’t support HDFS or S3. You should compile it yourself.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

hcho3 commented 4 years ago

Take a look at https://xgboost.readthedocs.io/en/latest/jvm/index.html#installation-from-source