Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
+ ngc batch exec --commandline bash -c 'cat /raid/tmp/driver-agaricus-Main-CPU.log' 7117740
WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
INFO SparkContext: Running Spark version 3.5.0
INFO SparkContext: OS info Linux, 5.4.0-107-generic, amd64
INFO SparkContext: Java version 1.8.0_402
WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/and LOCAL_DIRS in YARN).
INFO ResourceUtils: ==============================================================
INFO ResourceUtils: No custom resources configured for spark.driver.
INFO ResourceUtils: ==============================================================
INFO SparkContext: Submitted application: Agaricus-Mai-csv
INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 8, script: , vendor: , memory t: 32768, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
INFO ResourceProfile: Limiting resource is cpus at 8 tasks per executor
INFO ResourceProfileManager: Added ResourceProfile id: 0
INFO SecurityManager: Changing view acls to: root
INFO SecurityManager: Changing modify acls to: root
INFO SecurityManager: Changing view acls groups to:
INFO SecurityManager: Changing modify acls groups to:
INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: root; groups with view ers with modify permissions: root; groups with modify permissions: EMPTY
INFO Utils: Successfully started service 'sparkDriver' on port 39803.
INFO SparkEnv: Registering MapOutputTracker
INFO SparkEnv: Registering BlockManagerMaster
INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
INFO SparkEnv: Registering BlockManagerMasterHeartbeat
INFO DiskBlockManager: Created local directory at /raid/tmp/blockmgr-0034f8a7-578b-4364-bce3-68225f9bf27b
INFO MemoryStore: MemoryStore started with capacity 8.4 GiB
INFO SparkEnv: Registering OutputCommitCoordinator
INFO JettyUtils: Start Jetty 0.0.0.0:4040 for SparkUI
INFO Utils: Successfully started service 'SparkUI' on port 4040.
INFO SparkContext: Added JAR file:///test/xgboost4j-spark.jar at spark://127.0.0.1:39803/jars/xgboost4j-spark.jar with timestamp
INFO SparkContext: Added JAR file:/test/xgb-apps.jar at spark://127.0.0.1:39803/jars/xgb-apps.jar with timestamp 1729610887859
INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://127.0.0.1:7077...
INFO TransportClientFactory: Successfully created connection to /127.0.0.1:7077 after 41 ms (0 ms spent in bootstraps)
INFO StandaloneSchedulerBackend: Connected to Spark cluster with app ID app-20241022152809-0001
INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20241022152809-0001/0 on worker-20241022145613-127.0.0.1-35209 8 core(s)
INFO StandaloneSchedulerBackend: Granted executor ID app-20241022152809-0001/0 on hostPort 127.0.0.1:35209 with 8 core(s), 32.0 GiB RAM
INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20241022152809-0001/1 on worker-20241022145613-127.0.0.1-35209 8 core(s)
INFO StandaloneSchedulerBackend: Granted executor ID app-20241022152809-0001/1 on hostPort 127.0.0.1:35209 with 8 core(s), 32.0 GiB RAM
INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20241022152809-0001/2 on worker-20241022145611-127.0.0.1-42465 8 core(s)
INFO StandaloneSchedulerBackend: Granted executor ID app-20241022152809-0001/2 on hostPort 127.0.0.1:42465 with 8 core(s), 32.0 GiB RAM
INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20241022152809-0001/3 on worker-20241022145611-127.0.0.1-42465 8 core(s)
INFO StandaloneSchedulerBackend: Granted executor ID app-20241022152809-0001/3 on hostPort 127.0.0.1:42465 with 8 core(s), 32.0 GiB RAM
INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 40511.
INFO NettyBlockTransferService: Server created on 127.0.0.1:40511
INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 127.0.0.1, 40511, None)
INFO BlockManagerMasterEndpoint: Registering block manager 127.0.0.1:40511 with 8.4 GiB RAM, BlockManagerId(driver, 127.0.0.1, 40511,
INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 127.0.0.1, 40511, None)
INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 127.0.0.1, 40511, None)
INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20241022152809-0001/3 is now RUNNING
INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20241022152809-0001/2 is now RUNNING
INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20241022152809-0001/1 is now RUNNING
INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20241022152809-0001/0 is now RUNNING
INFO SingleEventLogFileWriter: Logging events to file:/tmp/spark-events/app-20241022152809-0001.inprogress
INFO StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir.
INFO SharedState: Warehouse path is 'file:/spark-warehouse'.
WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
INFO MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
INFO MetricsSystemImpl: s3a-file-system metrics system started
INFO StandaloneSchedulerBackend$StandaloneDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) ID 2, ResourceProfileId 0
INFO BlockManagerMasterEndpoint: Registering block manager 127.0.0.1:40013 with 16.9 GiB RAM, BlockManagerId(2, 127.0.0.1, 40013, None)
INFO StandaloneSchedulerBackend$StandaloneDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) ID 0, ResourceProfileId 0
INFO BlockManagerMasterEndpoint: Registering block manager 127.0.0.1:35771 with 16.9 GiB RAM, BlockManagerId(0, 127.0.0.1, 35771, None)
INFO InMemoryFileIndex: It took 83 ms to list leaf files for 1 paths.
INFO StandaloneSchedulerBackend$StandaloneDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) ID 1, ResourceProfileId 0
INFO StandaloneSchedulerBackend$StandaloneDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) ID 3, ResourceProfileId 0
INFO BlockManagerMasterEndpoint: Registering block manager 127.0.0.1:38295 with 16.9 GiB RAM, BlockManagerId(1, 127.0.0.1, 38295, None)
INFO BlockManagerMasterEndpoint: Registering block manager 127.0.0.1:45991 with 16.9 GiB RAM, BlockManagerId(3, 127.0.0.1, 45991, None)
INFO InMemoryFileIndex: It took 26 ms to list leaf files for 1 paths.
------ Training ------
Exception in thread "main" WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can 'spark.sql.debug.maxToStringFields'.
org.apache.spark.sql.AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `features` cannot be resolved. Did you ing? [`feature_0`, `feature_1`, `feature_2`, `feature_3`, `feature_4`].;
'Project [cast(label#509 as float) AS label#639, 'features]
+- Project [cast(label#0 as double) AS label#509, feature_0#1, feature_1#2, feature_2#3, feature_3#4, feature_4#5, feature_5#6, feature_6#7, feature_7#8, #10, feature_10#11, feature_11#12, feature_12#13, feature_13#14, feature_14#15, feature_15#16, feature_16#17, feature_17#18, feature_18#19, feature_19#20, _21#22, feature_22#23, ... 103 more fields]
+- Relation eature_1#2,feature_2#3,feature_3#4,feature_4#5,feature_5#6,feature_6#7,feature_7#8,feature_8#9,feature_9#10,feature_10#11,feature_11#12,feature_12#13,featureeature_15#16,feature_16#17,feature_17#18,feature_18#19,feature_19#20,feature_20#21,feature_21#22,feature_22#23,... 103 more fields] csv
at org.apache.spark.sql.errors.QueryCompilationErrors$.unresolvedAttributeError(QueryCompilationErrors.scala:307)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.org$apache$spark$sql$catalyst$analysis$CheckAnalysis$te(CheckAnalysis.scala:147)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$6(CheckAnalysis.scala:266)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$6$adapted(CheckAnalysis.scala:264)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:244)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5(CheckAnalysis.scala:264)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5$adapted(CheckAnalysis.scala:264)
at scala.collection.immutable.Stream.foreach(Stream.scala:533)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$2(CheckAnalysis.scala:264)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$2$adapted(CheckAnalysis.scala:182)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:244)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis0(CheckAnalysis.scala:182)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis0$(CheckAnalysis.scala:164)
at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis0(Analyzer.scala:188)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:160)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:150)
at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:188)
at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:211)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:330)
at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:208)
at org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:77)
at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:138)
at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$2(QueryExecution.scala:219)
at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:546)
at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:219)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:218)
at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:77)
at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:74)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:66)
at org.apache.spark.sql.Dataset$.$anonfun$ofRows$1(Dataset.scala:91)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:89)
at org.apache.spark.sql.Dataset.withPlan(Dataset.scala:4363)
at org.apache.spark.sql.Dataset.select(Dataset.scala:1541)
at ml.dmlc.xgboost4j.scala.spark.XGBoostEstimator.preprocess(XGBoostEstimator.scala:210)
at ml.dmlc.xgboost4j.scala.spark.XGBoostEstimator.preprocess$(XGBoostEstimator.scala:188)
at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.preprocess(XGBoostClassifier.scala:33)
at ml.dmlc.xgboost4j.scala.spark.XGBoostEstimator.train(XGBoostEstimator.scala:415)
at ml.dmlc.xgboost4j.scala.spark.XGBoostEstimator.train$(XGBoostEstimator.scala:409)
at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:33)
at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:33)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:114)
at com.nvidia.spark.examples.agaricus.Main$.$anonfun$main$8(Main.scala:77)
at com.nvidia.spark.examples.utility.Benchmark.time(Benchmark.scala:29)
at com.nvidia.spark.examples.agaricus.Main$.main(Main.scala:77)
at com.nvidia.spark.examples.agaricus.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1029)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:194)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:217)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1120)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1129)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
INFO SparkContext: Invoking stop() from shutdown hook
INFO SparkContext: SparkContext is stopping with exitCode 0.
INFO SparkUI: Stopped Spark web UI at http://127.0.0.1:4040
INFO StandaloneSchedulerBackend: Shutting down all executors
INFO StandaloneSchedulerBackend$StandaloneDriverEndpoint: Asking each executor to shut down
INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
INFO MemoryStore: MemoryStore cleared
INFO BlockManager: BlockManager stopped
INFO BlockManagerMaster: BlockManagerMaster stopped
INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
ERROR TransportRequestHandler: Error sending result StreamResponse[streamId=/jars/xgboost4j-8696354,body=FileSegmentManagedBuffer[file=/test/xgboost4j-spark.jar,offset=0,length=338696354]] to /127.0.0.1:33046; closing connection
io.netty.channel.StacklessClosedChannelException
at io.netty.channel.AbstractChannel.close(ChannelPromise)(Unknown Source)
INFO SparkContext: Successfully stopped SparkContext
INFO ShutdownHookManager: Shutdown hook called
INFO ShutdownHookManager: Deleting directory /tmp/spark-1d29e677-7338-4fc8-bec8-e57284298ca1
INFO ShutdownHookManager: Deleting directory /raid/tmp/spark-0dcd6655-62da-49f8-ba12-59f4e9c5739c
INFO MetricsSystemImpl: Stopping s3a-file-system metrics system...
INFO MetricsSystemImpl: s3a-file-system metrics system stopped.
INFO MetricsSystemImpl: s3a-file-system metrics system shutdown complete.
real 0m15.488s
user 0m26.418s
sys 0m3.454s
0
XGBoostj4-spark train failed on the CPU hosts,
ENVS:
1, OS: ubuntu22.04/NGC
2, Spark ver: 3.5.1
3, XGBoost4j-spark: xgboost4j-spark-gpu_2.12-2.2.0-SNAPSHOT.jar
4, rapids-4-spark: 24.12.0-SNAPSHOT
5, failed test agaricus train