dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.22k stars 8.72k forks source link

XGBoostj4-spark train failed on the CPU hosts #10926

Open NvTimLiu opened 2 hours ago

NvTimLiu commented 2 hours ago

XGBoostj4-spark train failed on the CPU hosts,

ENVS:

1, OS: ubuntu22.04/NGC

2, Spark ver: 3.5.1

3, XGBoost4j-spark: xgboost4j-spark-gpu_2.12-2.2.0-SNAPSHOT.jar

4, rapids-4-spark: 24.12.0-SNAPSHOT

5, failed test agaricus train


 + ngc batch exec --commandline bash -c 'cat /raid/tmp/driver-agaricus-Main-CPU.log' 7117740
  WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
  INFO SparkContext: Running Spark version 3.5.0
  INFO SparkContext: OS info Linux, 5.4.0-107-generic, amd64
  INFO SparkContext: Java version 1.8.0_402
  WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/and LOCAL_DIRS in YARN).
  INFO ResourceUtils: ==============================================================
  INFO ResourceUtils: No custom resources configured for spark.driver.
  INFO ResourceUtils: ==============================================================
  INFO SparkContext: Submitted application: Agaricus-Mai-csv
  INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 8, script: , vendor: , memory t: 32768, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
  INFO ResourceProfile: Limiting resource is cpus at 8 tasks per executor
  INFO ResourceProfileManager: Added ResourceProfile id: 0
  INFO SecurityManager: Changing view acls to: root
  INFO SecurityManager: Changing modify acls to: root
  INFO SecurityManager: Changing view acls groups to: 
  INFO SecurityManager: Changing modify acls groups to: 
  INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: root; groups with view ers with modify permissions: root; groups with modify permissions: EMPTY
  INFO Utils: Successfully started service 'sparkDriver' on port 39803.
  INFO SparkEnv: Registering MapOutputTracker
  INFO SparkEnv: Registering BlockManagerMaster
  INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
  INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
  INFO SparkEnv: Registering BlockManagerMasterHeartbeat
  INFO DiskBlockManager: Created local directory at /raid/tmp/blockmgr-0034f8a7-578b-4364-bce3-68225f9bf27b
  INFO MemoryStore: MemoryStore started with capacity 8.4 GiB
  INFO SparkEnv: Registering OutputCommitCoordinator
  INFO JettyUtils: Start Jetty 0.0.0.0:4040 for SparkUI
  INFO Utils: Successfully started service 'SparkUI' on port 4040.
  INFO SparkContext: Added JAR file:///test/xgboost4j-spark.jar at spark://127.0.0.1:39803/jars/xgboost4j-spark.jar with timestamp 
  INFO SparkContext: Added JAR file:/test/xgb-apps.jar at spark://127.0.0.1:39803/jars/xgb-apps.jar with timestamp 1729610887859
  INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://127.0.0.1:7077...
  INFO TransportClientFactory: Successfully created connection to /127.0.0.1:7077 after 41 ms (0 ms spent in bootstraps)
  INFO StandaloneSchedulerBackend: Connected to Spark cluster with app ID app-20241022152809-0001
  INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20241022152809-0001/0 on worker-20241022145613-127.0.0.1-35209  8 core(s)
  INFO StandaloneSchedulerBackend: Granted executor ID app-20241022152809-0001/0 on hostPort 127.0.0.1:35209 with 8 core(s), 32.0 GiB RAM
  INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20241022152809-0001/1 on worker-20241022145613-127.0.0.1-35209  8 core(s)
  INFO StandaloneSchedulerBackend: Granted executor ID app-20241022152809-0001/1 on hostPort 127.0.0.1:35209 with 8 core(s), 32.0 GiB RAM
  INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20241022152809-0001/2 on worker-20241022145611-127.0.0.1-42465  8 core(s)
  INFO StandaloneSchedulerBackend: Granted executor ID app-20241022152809-0001/2 on hostPort 127.0.0.1:42465 with 8 core(s), 32.0 GiB RAM
  INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20241022152809-0001/3 on worker-20241022145611-127.0.0.1-42465  8 core(s)
  INFO StandaloneSchedulerBackend: Granted executor ID app-20241022152809-0001/3 on hostPort 127.0.0.1:42465 with 8 core(s), 32.0 GiB RAM
  INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 40511.
  INFO NettyBlockTransferService: Server created on 127.0.0.1:40511
  INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
  INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 127.0.0.1, 40511, None)
  INFO BlockManagerMasterEndpoint: Registering block manager 127.0.0.1:40511 with 8.4 GiB RAM, BlockManagerId(driver, 127.0.0.1, 40511, 
  INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 127.0.0.1, 40511, None)
  INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 127.0.0.1, 40511, None)
  INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20241022152809-0001/3 is now RUNNING
  INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20241022152809-0001/2 is now RUNNING
  INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20241022152809-0001/1 is now RUNNING
  INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20241022152809-0001/0 is now RUNNING
  INFO SingleEventLogFileWriter: Logging events to file:/tmp/spark-events/app-20241022152809-0001.inprogress
  INFO StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
  INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir.
  INFO SharedState: Warehouse path is 'file:/spark-warehouse'.
  WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
  INFO MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
  INFO MetricsSystemImpl: s3a-file-system metrics system started
  INFO StandaloneSchedulerBackend$StandaloneDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor)  ID 2,  ResourceProfileId 0
  INFO BlockManagerMasterEndpoint: Registering block manager 127.0.0.1:40013 with 16.9 GiB RAM, BlockManagerId(2, 127.0.0.1, 40013, None)
  INFO StandaloneSchedulerBackend$StandaloneDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor)  ID 0,  ResourceProfileId 0
  INFO BlockManagerMasterEndpoint: Registering block manager 127.0.0.1:35771 with 16.9 GiB RAM, BlockManagerId(0, 127.0.0.1, 35771, None)
  INFO InMemoryFileIndex: It took 83 ms to list leaf files for 1 paths.
  INFO StandaloneSchedulerBackend$StandaloneDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor)  ID 1,  ResourceProfileId 0
  INFO StandaloneSchedulerBackend$StandaloneDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor)  ID 3,  ResourceProfileId 0
  INFO BlockManagerMasterEndpoint: Registering block manager 127.0.0.1:38295 with 16.9 GiB RAM, BlockManagerId(1, 127.0.0.1, 38295, None)
  INFO BlockManagerMasterEndpoint: Registering block manager 127.0.0.1:45991 with 16.9 GiB RAM, BlockManagerId(3, 127.0.0.1, 45991, None)
  INFO InMemoryFileIndex: It took 26 ms to list leaf files for 1 paths.

 ------ Training ------
 Exception in thread "main"  WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can  'spark.sql.debug.maxToStringFields'.
 org.apache.spark.sql.AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `features` cannot be resolved. Did you ing? [`feature_0`, `feature_1`, `feature_2`, `feature_3`, `feature_4`].;
 'Project [cast(label#509 as float) AS label#639, 'features]
 +- Project [cast(label#0 as double) AS label#509, feature_0#1, feature_1#2, feature_2#3, feature_3#4, feature_4#5, feature_5#6, feature_6#7, feature_7#8, #10, feature_10#11, feature_11#12, feature_12#13, feature_13#14, feature_14#15, feature_15#16, feature_16#17, feature_17#18, feature_18#19, feature_19#20, _21#22, feature_22#23, ... 103 more fields]
    +- Relation eature_1#2,feature_2#3,feature_3#4,feature_4#5,feature_5#6,feature_6#7,feature_7#8,feature_8#9,feature_9#10,feature_10#11,feature_11#12,feature_12#13,featureeature_15#16,feature_16#17,feature_17#18,feature_18#19,feature_19#20,feature_20#21,feature_21#22,feature_22#23,... 103 more fields] csv

    at org.apache.spark.sql.errors.QueryCompilationErrors$.unresolvedAttributeError(QueryCompilationErrors.scala:307)
    at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.org$apache$spark$sql$catalyst$analysis$CheckAnalysis$te(CheckAnalysis.scala:147)
    at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$6(CheckAnalysis.scala:266)
    at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$6$adapted(CheckAnalysis.scala:264)
    at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:244)
    at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5(CheckAnalysis.scala:264)
    at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5$adapted(CheckAnalysis.scala:264)
    at scala.collection.immutable.Stream.foreach(Stream.scala:533)
    at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$2(CheckAnalysis.scala:264)
    at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$2$adapted(CheckAnalysis.scala:182)
    at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:244)
    at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis0(CheckAnalysis.scala:182)
    at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis0$(CheckAnalysis.scala:164)
    at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis0(Analyzer.scala:188)
    at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:160)
    at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:150)
    at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:188)
    at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:211)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:330)
    at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:208)
    at org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:77)
    at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:138)
    at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$2(QueryExecution.scala:219)
    at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:546)
    at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:219)
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
    at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:218)
    at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:77)
    at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:74)
    at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:66)
    at org.apache.spark.sql.Dataset$.$anonfun$ofRows$1(Dataset.scala:91)
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
    at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:89)
    at org.apache.spark.sql.Dataset.withPlan(Dataset.scala:4363)
    at org.apache.spark.sql.Dataset.select(Dataset.scala:1541)
    at ml.dmlc.xgboost4j.scala.spark.XGBoostEstimator.preprocess(XGBoostEstimator.scala:210)
    at ml.dmlc.xgboost4j.scala.spark.XGBoostEstimator.preprocess$(XGBoostEstimator.scala:188)
    at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.preprocess(XGBoostClassifier.scala:33)
    at ml.dmlc.xgboost4j.scala.spark.XGBoostEstimator.train(XGBoostEstimator.scala:415)
    at ml.dmlc.xgboost4j.scala.spark.XGBoostEstimator.train$(XGBoostEstimator.scala:409)
    at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:33)
    at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:33)
    at org.apache.spark.ml.Predictor.fit(Predictor.scala:114)
    at com.nvidia.spark.examples.agaricus.Main$.$anonfun$main$8(Main.scala:77)
    at com.nvidia.spark.examples.utility.Benchmark.time(Benchmark.scala:29)
    at com.nvidia.spark.examples.agaricus.Main$.main(Main.scala:77)
    at com.nvidia.spark.examples.agaricus.Main.main(Main.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1029)
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:194)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:217)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1120)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1129)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
  INFO SparkContext: Invoking stop() from shutdown hook
  INFO SparkContext: SparkContext is stopping with exitCode 0.
  INFO SparkUI: Stopped Spark web UI at http://127.0.0.1:4040
  INFO StandaloneSchedulerBackend: Shutting down all executors
  INFO StandaloneSchedulerBackend$StandaloneDriverEndpoint: Asking each executor to shut down
  INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
  INFO MemoryStore: MemoryStore cleared
  INFO BlockManager: BlockManager stopped
  INFO BlockManagerMaster: BlockManagerMaster stopped
  INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
  ERROR TransportRequestHandler: Error sending result StreamResponse[streamId=/jars/xgboost4j-8696354,body=FileSegmentManagedBuffer[file=/test/xgboost4j-spark.jar,offset=0,length=338696354]] to /127.0.0.1:33046; closing connection
 io.netty.channel.StacklessClosedChannelException
    at io.netty.channel.AbstractChannel.close(ChannelPromise)(Unknown Source)
  INFO SparkContext: Successfully stopped SparkContext
  INFO ShutdownHookManager: Shutdown hook called
  INFO ShutdownHookManager: Deleting directory /tmp/spark-1d29e677-7338-4fc8-bec8-e57284298ca1
  INFO ShutdownHookManager: Deleting directory /raid/tmp/spark-0dcd6655-62da-49f8-ba12-59f4e9c5739c
  INFO MetricsSystemImpl: Stopping s3a-file-system metrics system...
  INFO MetricsSystemImpl: s3a-file-system metrics system stopped.
  INFO MetricsSystemImpl: s3a-file-system metrics system shutdown complete.

 real   0m15.488s
 user   0m26.418s
 sys    0m3.454s

 0
NvTimLiu commented 2 hours ago

@wbo4958