databricks / xgboost-linux64

Databricks Private xgboost Linux64 fork
Other
8 stars 14 forks source link

count:poisson regression executors dying with java.lang.NumberFormatException: For input string: "inf" #1

Open nightflight-dk opened 6 years ago

nightflight-dk commented 6 years ago

Hello below error keeps killing my executors when trying count:poisson regression

Environment info

Operating System: Databricks Runtime 4.1 ML XGBoost included in the runtime: 4.1 ML Beta (includes Apache Spark 2.3.0, Scala 2.11)


18/06/27 17:08:25 WARN BlockManager: Putting block rdd_4919_0 failed due to exception java.lang.NumberFormatException: For input string: "inf". 18/06/27 17:08:25 WARN BlockManager: Putting block rdd_4919_5 failed due to exception java.lang.NumberFormatException: For input string: "inf". 18/06/27 17:08:25 WARN BlockManager: Block rdd_4919_5 could not be removed as it was not found on disk or in memory 18/06/27 17:08:25 WARN BlockManager: Block rdd_4919_0 could not be removed as it was not found on disk or in memory 18/06/27 17:08:25 ERROR Executor: Exception in task 0.0 in stage 2292.0 (TID 180564) java.lang.NumberFormatException: For input string: "inf" at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043) at sun.misc.FloatingDecimal.parseFloat(FloatingDecimal.java:122) at java.lang.Float.parseFloat(Float.java:451) at java.lang.Float.valueOf(Float.java:416) at ml.dmlc.xgboost4j.java.Booster.evalSet(Booster.java:189) at ml.dmlc.xgboost4j.java.XGBoost.train(XGBoost.java:194) at ml.dmlc.xgboost4j.scala.XGBoost$.train(XGBoost.scala:64) at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$buildDistributedBoosters$1.apply(XGBoost.scala:140) at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$buildDistributedBoosters$1.apply(XGBoost.scala:117) at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:98) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:336) at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:349) at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:347) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1092) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1083) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1018) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1083) at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:809) at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:347) at org.apache.spark.rdd.RDD.iterator(RDD.scala:298) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:111) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:349) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

Steps to reproduce

35 ordinal features, no missing values

import ml.dmlc.xgboost4j.scala.spark.XGBoostEstimator import ml.dmlc.xgboost4j.scala.spark.{DataUtils, XGBoost}

System.setSecurityManager(null) // necessary, otherwise java.security.AccessControlException: access denied org.apache.derby.security.SystemPermission( "engine", "usederbyinternals" )

val paramMap = List( "eval_metric" -> "mae", "nworkers" -> sc.getExecutorMemoryStatus.size,

"objective" -> "count:poisson", // Job aborted due to stage failure: ExecutorLostFailure "early_stopping_rounds" ->10).toMap

val xgboostEstimator = new XGBoostEstimator(paramMap) val xgboostModel = xgboostEstimator.fit(train)

I have tried limiting the training set to 1000 data points, and using different number of workers (1,6,24), other resource bound params made no impact either: eta, useExternalMemory

This also appears non deterministic - had the training pass once, to get the same failure on transforming the (also sampled) test set.

Thanks for considering. Please keep up the great work.

Damian

saikiranvadhi commented 3 years ago

@nightflight-dk I'm facing the same issue as you, have you found a solution for this? I'm running this on Spark 3.0 with xgboost4j_spark_2_12_1_3_1.jar and xgboost4j_2_12_1_3_1.jar.

21/02/13 06:22:20 WARN BlockManager: Putting block rdd_29_4 failed due to exception java.lang.NumberFormatException: For input string: "inf".
21/02/13 06:22:20 WARN BlockManager: Putting block rdd_29_0 failed due to exception java.lang.NumberFormatException: For input string: "inf".
21/02/13 06:22:20 WARN BlockManager: Putting block rdd_29_6 failed due to exception java.lang.NumberFormatException: For input string: "inf".
21/02/13 06:22:20 WARN BlockManager: Putting block rdd_29_2 failed due to exception java.lang.NumberFormatException: For input string: "inf".
21/02/13 06:22:20 WARN BlockManager: Block rdd_29_4 could not be removed as it was not found on disk or in memory
21/02/13 06:22:20 WARN BlockManager: Block rdd_29_6 could not be removed as it was not found on disk or in memory
21/02/13 06:22:20 WARN BlockManager: Block rdd_29_0 could not be removed as it was not found on disk or in memory
21/02/13 06:22:20 WARN BlockManager: Block rdd_29_2 could not be removed as it was not found on disk or in memory
21/02/13 06:22:20 ERROR Executor: Exception in task 4.0 in stage 8.0 (TID 27)
java.lang.NumberFormatException: For input string: "inf"
    at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)
    at sun.misc.FloatingDecimal.parseFloat(FloatingDecimal.java:122)
    at java.lang.Float.parseFloat(Float.java:451)
    at java.lang.Float.valueOf(Float.java:416)
    at ml.dmlc.xgboost4j.java.Booster.evalSet(Booster.java:251)
    at ml.dmlc.xgboost4j.java.XGBoost.trainAndSaveCheckpoint(XGBoost.java:215)
    at ml.dmlc.xgboost4j.java.XGBoost.train(XGBoost.java:284)
    at ml.dmlc.xgboost4j.scala.XGBoost$.$anonfun$trainAndSaveCheckpoint$5(XGBoost.scala:66)
    at scala.Option.getOrElse(Option.scala:189)
    at ml.dmlc.xgboost4j.scala.XGBoost$.trainAndSaveCheckpoint(XGBoost.scala:62)
    at ml.dmlc.xgboost4j.scala.XGBoost$.train(XGBoost.scala:106)
    at ml.dmlc.xgboost4j.scala.spark.XGBoost$.buildDistributedBooster(XGBoost.scala:416)
    at ml.dmlc.xgboost4j.scala.spark.XGBoost$.$anonfun$trainForNonRanking$1(XGBoost.scala:499)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:844)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:844)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:356)
    at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:369)
    at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1376)
    at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1303)
    at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1367)
    at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1187)
    at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:367)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:318)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.doRunTask(Task.scala:144)
    at org.apache.spark.scheduler.Task.run(Task.scala:117)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:640)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1581)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:643)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)