Closed karanveersingh5623 closed 1 year ago
@tgrel , please let me know if any further info is required
@tgrel , even after not changing any scripts and running with day_* files , still its failing , do I have to make docker image with T4 GPUs ?
(rapids) root@00cae536f649:/workspace/dlrm/preproc# ./prepare_dataset.sh 2 GPU Spark
+ ls -ltrash
total 124K
4.0K -rw-r--r--. 1 root root 1.6K Apr 20 11:06 NVT_shuffle_spark.py
4.0K -rw-r--r--. 1 root root 1.3K Apr 20 11:06 DGX-A100_config.sh
4.0K -rw-r--r--. 1 root root 1.3K Apr 20 11:06 DGX-2_config.sh
8.0K -rw-r--r--. 1 root root 5.2K Apr 20 11:06 split_dataset.py
4.0K -rwxr-xr-x. 1 root root 1.1K Apr 20 11:06 run_spark.sh
8.0K -rw-r--r--. 1 root root 7.6K Apr 20 11:06 run_spark_gpu_DGX-A100.sh
8.0K -rwxr-xr-x. 1 root root 7.6K Apr 20 11:06 run_spark_gpu_DGX-2.sh
8.0K -rwxr-xr-x. 1 root root 5.7K Apr 20 11:06 run_spark_cpu.sh
4.0K -rwxr-xr-x. 1 root root 3.1K Apr 20 11:06 run_NVTabular.sh
12K -rw-r--r--. 1 root root 11K Apr 20 11:06 preproc_NVTabular.py
4.0K -rw-r--r--. 1 root root 3.4K Apr 20 11:06 parquet_to_binary.py
0 drwxr-xr-x 1 root root 21 Jun 13 05:26 ..
4.0K -rwxrwxrwx. 1 root root 3.0K Jun 13 09:55 prepare_dataset.sh
8.0K -rw-r--r-- 1 root root 6.6K Jun 14 01:08 submit_train_log.txt
20K -rw-r--r-- 1 root root 18K Jun 14 01:24 submit_dict_log.txt
20K -rw-r--r-- 1 root root 20K Jun 14 01:48 spark_data_utils.py
4.0K -rwxr-xr-x 1 root root 1.1K Jun 14 01:48 verify_criteo_downloaded.sh
0 drwxr-xr-x. 1 root root 149 Jun 14 01:48 .
+ rm -rf /data/dlrm/spark
+ rm -rf /data/dlrm/intermediate_binary
+ rm -rf /data/dlrm/output
+ rm -rf /data/dlrm/criteo_parquet
+ rm -rf /data/dlrm/binary_dataset
+ download_dir=/data/dlrm
+ ./verify_criteo_downloaded.sh /data/dlrm
++ download_dir=/data/dlrm
++ cd /data/dlrm
+++ seq 0 23
++ for i in $(seq 0 23)
++ filename=day_0
++ '[' -f day_0 ']'
++ echo 'day_0 exists, OK'
day_0 exists, OK
++ for i in $(seq 0 23)
++ filename=day_1
++ '[' -f day_1 ']'
++ echo 'day_1 exists, OK'
day_1 exists, OK
++ for i in $(seq 0 23)
++ filename=day_2
++ '[' -f day_2 ']'
++ echo 'day_2 exists, OK'
day_2 exists, OK
++ for i in $(seq 0 23)
++ filename=day_3
++ '[' -f day_3 ']'
++ echo 'day_3 exists, OK'
day_3 exists, OK
++ for i in $(seq 0 23)
++ filename=day_4
++ '[' -f day_4 ']'
++ echo 'day_4 exists, OK'
day_4 exists, OK
++ for i in $(seq 0 23)
++ filename=day_5
++ '[' -f day_5 ']'
++ echo 'day_5 exists, OK'
day_5 exists, OK
++ for i in $(seq 0 23)
++ filename=day_6
++ '[' -f day_6 ']'
++ echo 'day_6 exists, OK'
day_6 exists, OK
++ for i in $(seq 0 23)
++ filename=day_7
++ '[' -f day_7 ']'
++ echo 'day_7 exists, OK'
day_7 exists, OK
++ for i in $(seq 0 23)
++ filename=day_8
++ '[' -f day_8 ']'
++ echo 'day_8 exists, OK'
day_8 exists, OK
++ for i in $(seq 0 23)
++ filename=day_9
++ '[' -f day_9 ']'
++ echo 'day_9 exists, OK'
day_9 exists, OK
++ for i in $(seq 0 23)
++ filename=day_10
++ '[' -f day_10 ']'
++ echo 'day_10 exists, OK'
day_10 exists, OK
++ for i in $(seq 0 23)
++ filename=day_11
++ '[' -f day_11 ']'
++ echo 'day_11 exists, OK'
day_11 exists, OK
++ for i in $(seq 0 23)
++ filename=day_12
++ '[' -f day_12 ']'
++ echo 'day_12 exists, OK'
day_12 exists, OK
++ for i in $(seq 0 23)
++ filename=day_13
++ '[' -f day_13 ']'
++ echo 'day_13 exists, OK'
day_13 exists, OK
++ for i in $(seq 0 23)
++ filename=day_14
++ '[' -f day_14 ']'
++ echo 'day_14 exists, OK'
day_14 exists, OK
++ for i in $(seq 0 23)
++ filename=day_15
++ '[' -f day_15 ']'
++ echo 'day_15 exists, OK'
day_15 exists, OK
++ for i in $(seq 0 23)
++ filename=day_16
++ '[' -f day_16 ']'
++ echo 'day_16 exists, OK'
day_16 exists, OK
++ for i in $(seq 0 23)
++ filename=day_17
++ '[' -f day_17 ']'
++ echo 'day_17 exists, OK'
day_17 exists, OK
++ for i in $(seq 0 23)
++ filename=day_18
++ '[' -f day_18 ']'
++ echo 'day_18 exists, OK'
day_18 exists, OK
++ for i in $(seq 0 23)
++ filename=day_19
++ '[' -f day_19 ']'
++ echo 'day_19 exists, OK'
day_19 exists, OK
++ for i in $(seq 0 23)
++ filename=day_20
++ '[' -f day_20 ']'
++ echo 'day_20 exists, OK'
day_20 exists, OK
++ for i in $(seq 0 23)
++ filename=day_21
++ '[' -f day_21 ']'
++ echo 'day_21 exists, OK'
day_21 exists, OK
++ for i in $(seq 0 23)
++ filename=day_22
++ '[' -f day_22 ']'
++ echo 'day_22 exists, OK'
day_22 exists, OK
++ for i in $(seq 0 23)
++ filename=day_23
++ '[' -f day_23 ']'
++ echo 'day_23 exists, OK'
day_23 exists, OK
++ cd -
/workspace/dlrm/preproc
++ echo 'Criteo data verified'
Criteo data verified
+ output_path=/data/dlrm/output
+ '[' Spark = NVTabular ']'
+ '[' -f /data/dlrm/output/train/_SUCCESS ']'
+ echo 'Performing spark preprocessing'
Performing spark preprocessing
+ ./run_spark.sh GPU /data/dlrm /data/dlrm/output 2
Input mode option: GPU
Run with GPU.
Starting spark standalone
starting org.apache.spark.deploy.master.Master, logging to /opt/spark/logs/spark--org.apache.spark.deploy.master.Master-1-00cae536f649.out
starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark--org.apache.spark.deploy.worker.Worker-1-00cae536f649.out
Generating the dictionary...
23/06/14 08:16:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
23/06/14 08:16:08 INFO SparkContext: Running Spark version 3.0.1
23/06/14 08:16:08 INFO ResourceUtils: ==============================================================
23/06/14 08:16:08 INFO ResourceUtils: Resources for spark.driver:
23/06/14 08:16:08 INFO ResourceUtils: ==============================================================
23/06/14 08:16:08 INFO SparkContext: Submitted application: spark_data_utils.py
23/06/14 08:16:08 INFO SecurityManager: Changing view acls to: root
23/06/14 08:16:08 INFO SecurityManager: Changing modify acls to: root
23/06/14 08:16:08 INFO SecurityManager: Changing view acls groups to:
23/06/14 08:16:08 INFO SecurityManager: Changing modify acls groups to:
23/06/14 08:16:08 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
23/06/14 08:16:09 INFO Utils: Successfully started service 'sparkDriver' on port 34076.
23/06/14 08:16:09 INFO SparkEnv: Registering MapOutputTracker
23/06/14 08:16:09 INFO SparkEnv: Registering BlockManagerMaster
23/06/14 08:16:09 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
23/06/14 08:16:09 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
23/06/14 08:16:09 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
23/06/14 08:16:09 INFO DiskBlockManager: Created local directory at /data/dlrm/spark/tmp/blockmgr-a83c50b0-6810-4489-9ac2-844b63b09261
23/06/14 08:16:09 INFO MemoryStore: MemoryStore started with capacity 16.9 GiB
23/06/14 08:16:09 INFO SparkEnv: Registering OutputCommitCoordinator
23/06/14 08:16:09 INFO Utils: Successfully started service 'SparkUI' on port 4040.
23/06/14 08:16:09 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://00cae536f649:4040
23/06/14 08:16:09 INFO DriverPluginContainer: Initialized driver component for plugin com.nvidia.spark.SQLPlugin.
23/06/14 08:16:09 WARN SparkContext: The configuration of resource: gpu (exec = 1, task = 1/100, runnable tasks = 100) will result in wasted resources due to resource CPU limiting the number of runnable tasks per executor to: 5. Please adjust your configuration.
23/06/14 08:16:09 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://00cae536f649:7077...
23/06/14 08:16:09 INFO TransportClientFactory: Successfully created connection to 00cae536f649/172.17.0.2:7077 after 47 ms (0 ms spent in bootstraps)
23/06/14 08:16:10 INFO StandaloneSchedulerBackend: Connected to Spark cluster with app ID app-20230614081610-0000
23/06/14 08:16:10 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 38388.
23/06/14 08:16:10 INFO NettyBlockTransferService: Server created on 00cae536f649:38388
23/06/14 08:16:10 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
23/06/14 08:16:10 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 00cae536f649, 38388, None)
23/06/14 08:16:10 INFO BlockManagerMasterEndpoint: Registering block manager 00cae536f649:38388 with 16.9 GiB RAM, BlockManagerId(driver, 00cae536f649, 38388, None)
23/06/14 08:16:10 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 00cae536f649, 38388, None)
23/06/14 08:16:10 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 00cae536f649, 38388, None)
23/06/14 08:16:10 INFO StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
23/06/14 08:16:10 WARN SQLExecPlugin: Installing extensions to enable rapids GPU SQL support. To disable GPU support set `spark.rapids.sql.enabled` to false
23/06/14 08:16:10 INFO ShimLoader: Loading shim for Spark version: 3.0.1
23/06/14 08:16:10 INFO ShimLoader: Found shims: List(com.nvidia.spark.rapids.shims.spark301.SparkShimServiceProvider@abcbf79)
23/06/14 08:16:10 WARN Plugin: Installing rapids UDF compiler extensions to Spark. The compiler is disabled by default. To enable it, set `spark.rapids.sql.udfCompiler.enabled` to true
23/06/14 08:16:10 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/workspace/dlrm/preproc/spark-warehouse').
23/06/14 08:16:10 INFO SharedState: Warehouse path is 'file:/workspace/dlrm/preproc/spark-warehouse'.
23/06/14 08:16:11 INFO InMemoryFileIndex: It took 54 ms to list leaf files for 24 paths.
23/06/14 08:16:13 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
23/06/14 08:16:14 INFO FileSourceStrategy: Pruning directories with:
23/06/14 08:16:14 INFO FileSourceStrategy: Pushed Filters:
23/06/14 08:16:14 INFO FileSourceStrategy: Post-Scan Filters:
23/06/14 08:16:14 INFO FileSourceStrategy: Output Data Schema: struct<_c14: string, _c15: string, _c16: string, _c17: string, _c18: string ... 24 more fields>
23/06/14 08:16:14 INFO HiveConf: Found configuration file null
23/06/14 08:16:14 WARN GpuOverrides:
*Exec <DataWritingCommandExec> will run on GPU
*Output <InsertIntoHadoopFsRelationCommand> will run on GPU
*Exec <ProjectExec> will run on GPU
*Expression <Alias> (monotonically_increasing_id() - cast(shiftleft(part_id#100, 33) as bigint)) AS mono_id#105L will run on GPU
*Expression <Subtract> (monotonically_increasing_id() - cast(shiftleft(part_id#100, 33) as bigint)) will run on GPU
*Expression <MonotonicallyIncreasingID> monotonically_increasing_id() will run on GPU
*Expression <Cast> cast(shiftleft(part_id#100, 33) as bigint) will run on GPU
*Expression <ShiftLeft> shiftleft(part_id#100, 33) will run on GPU
*Exec <ProjectExec> will run on GPU
*Expression <Alias> SPARK_PARTITION_ID() AS part_id#100 will run on GPU
*Expression <SparkPartitionID> SPARK_PARTITION_ID() will run on GPU
*Exec <SortExec> will run on GPU
*Expression <SortOrder> column_id#84 ASC NULLS FIRST will run on GPU
*Expression <SortOrder> count#93L DESC NULLS LAST will run on GPU
*Exec <ShuffleExchangeExec> will run on GPU
*Partitioning <RangePartitioning> will run on GPU
*Expression <SortOrder> column_id#84 ASC NULLS FIRST will run on GPU
*Expression <SortOrder> count#93L DESC NULLS LAST will run on GPU
*Exec <FilterExec> will run on GPU
*Expression <Or> (NOT column_id#84 INSET (0,5,10,24,25,14,20,1,6,21,9,13,2,17,22,12,7,3,18,16,11,23,8,19,4,15) OR (count#93L >= 2)) will run on GPU
*Expression <Not> NOT column_id#84 INSET (0,5,10,24,25,14,20,1,6,21,9,13,2,17,22,12,7,3,18,16,11,23,8,19,4,15) will run on GPU
*Expression <InSet> column_id#84 INSET (0,5,10,24,25,14,20,1,6,21,9,13,2,17,22,12,7,3,18,16,11,23,8,19,4,15) will run on GPU
*Expression <GreaterThanOrEqual> (count#93L >= 2) will run on GPU
*Exec <HashAggregateExec> will run on GPU
*Expression <AggregateExpression> count(1) will run on GPU
*Expression <Count> count(1) will run on GPU
*Expression <Alias> count(1)#92L AS count#93L will run on GPU
*Exec <ShuffleExchangeExec> will run on GPU
*Partitioning <HashPartitioning> will run on GPU
*Exec <HashAggregateExec> will run on GPU
*Expression <AggregateExpression> partial_count(1) will run on GPU
*Expression <Count> count(1) will run on GPU
*Exec <ProjectExec> will run on GPU
*Expression <Alias> pos#80 AS column_id#84 will run on GPU
*Expression <Alias> col#81 AS data#87 will run on GPU
*Exec <FilterExec> will run on GPU
*Expression <IsNotNull> isnotnull(col#81) will run on GPU
*Exec <GenerateExec> will run on GPU
*Exec <FileSourceScanExec> will run on GPU
23/06/14 08:16:15 INFO GpuParquetFileFormat: Using default output committer for Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter
23/06/14 08:16:15 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
23/06/14 08:16:15 INFO SQLHadoopMapReduceCommitProtocol: Using user defined output committer class org.apache.parquet.hadoop.ParquetOutputCommitter
23/06/14 08:16:15 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
23/06/14 08:16:15 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.parquet.hadoop.ParquetOutputCommitter
23/06/14 08:16:15 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 401.6 KiB, free 16.9 GiB)
23/06/14 08:16:15 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 24.9 KiB, free 16.9 GiB)
23/06/14 08:16:15 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 00cae536f649:38388 (size: 24.9 KiB, free: 16.9 GiB)
23/06/14 08:16:15 INFO SparkContext: Created broadcast 0 from broadcast at GpuReadCSVFileFormat.scala:46
23/06/14 08:16:15 INFO GpuFileSourceScanExec: Planning scan with bin packing, max size: 1073741824 bytes, open cost is considered as scanning 4194304 bytes.
23/06/14 08:16:15 INFO GpuFileSourceScanExec: Using the original per file parquet reader
23/06/14 08:16:16 INFO CodeGenerator: Code generated in 198.599907 ms
23/06/14 08:16:16 INFO SparkContext: Starting job: collect at GpuRangePartitioner.scala:46
23/06/14 08:16:16 INFO DAGScheduler: Registering RDD 7 (executeColumnar at GpuShuffleCoalesceExec.scala:67) as input to shuffle 0
23/06/14 08:16:16 INFO DAGScheduler: Got job 0 (collect at GpuRangePartitioner.scala:46) with 48 output partitions
23/06/14 08:16:16 INFO DAGScheduler: Final stage: ResultStage 1 (collect at GpuRangePartitioner.scala:46)
23/06/14 08:16:16 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
23/06/14 08:16:16 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0)
23/06/14 08:16:16 INFO DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[7] at executeColumnar at GpuShuffleCoalesceExec.scala:67), which has no missing parents
23/06/14 08:16:16 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 29.9 KiB, free 16.9 GiB)
23/06/14 08:16:16 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 13.0 KiB, free 16.9 GiB)
23/06/14 08:16:16 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 00cae536f649:38388 (size: 13.0 KiB, free: 16.9 GiB)
23/06/14 08:16:16 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1223
23/06/14 08:16:16 INFO DAGScheduler: Submitting 1037 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[7] at executeColumnar at GpuShuffleCoalesceExec.scala:67) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
23/06/14 08:16:16 INFO TaskSchedulerImpl: Adding task set 0.0 with 1037 tasks
23/06/14 08:16:31 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
23/06/14 08:16:46 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
23/06/14 08:17:01 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
23/06/14 08:17:16 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
23/06/14 08:17:31 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
Created docker image again with --build-arg NUMBER_OF_GPUS=2 , now its working
docker build -t nvidia_dlrm_preprocessing -f Dockerfile_preprocessing . --build-arg NUMBER_OF_GPUS=2
Dont close the issue now , just waiting for dataset to complete
@tgrel , everything was running fine but it failed in the end , spark java error
23/06/14 09:00:50 INFO BlockManagerInfo: Added broadcast_199_piece50 in memory on 172.17.0.2:46754 (size: 4.0 MiB, free: 12.4 GiB)
23/06/14 09:00:50 INFO BlockManagerInfo: Added broadcast_199_piece76 in memory on 172.17.0.2:46754 (size: 4.0 MiB, free: 12.4 GiB)
23/06/14 09:00:50 INFO BlockManagerInfo: Added broadcast_199_piece97 in memory on 172.17.0.2:46754 (size: 4.0 MiB, free: 12.4 GiB)
23/06/14 09:00:50 INFO BlockManagerInfo: Added broadcast_199_piece212 in memory on 172.17.0.2:46754 (size: 4.0 MiB, free: 12.4 GiB)
23/06/14 09:00:50 INFO BlockManagerInfo: Added broadcast_199_piece155 in memory on 172.17.0.2:46754 (size: 4.0 MiB, free: 12.4 GiB)
23/06/14 09:00:50 INFO BlockManagerInfo: Added broadcast_199_piece219 in memory on 172.17.0.2:46754 (size: 4.0 MiB, free: 12.4 GiB)
23/06/14 09:00:50 INFO BlockManagerInfo: Added broadcast_199_piece60 in memory on 172.17.0.2:46754 (size: 4.0 MiB, free: 12.4 GiB)
23/06/14 09:00:50 INFO BlockManagerInfo: Added broadcast_199_piece198 in memory on 172.17.0.2:46754 (size: 4.0 MiB, free: 12.4 GiB)
23/06/14 09:00:50 INFO BlockManagerInfo: Added broadcast_199_piece91 in memory on 172.17.0.2:46754 (size: 4.0 MiB, free: 12.4 GiB)
23/06/14 09:00:50 INFO BlockManagerInfo: Added broadcast_199_piece159 in memory on 172.17.0.2:46754 (size: 4.0 MiB, free: 12.4 GiB)
23/06/14 09:00:50 INFO BlockManagerInfo: Added broadcast_199_piece174 in memory on 172.17.0.2:46754 (size: 4.0 MiB, free: 12.4 GiB)
23/06/14 09:00:50 INFO BlockManagerInfo: Added broadcast_199_piece234 in memory on 172.17.0.2:46754 (size: 4.0 MiB, free: 12.4 GiB)
23/06/14 09:00:50 INFO BlockManagerInfo: Added broadcast_199_piece88 in memory on 172.17.0.2:46754 (size: 4.0 MiB, free: 12.4 GiB)
23/06/14 09:00:51 INFO TaskSetManager: Starting task 2.2 in stage 110.0 (TID 370, 172.17.0.2, executor 4, partition 2, PROCESS_LOCAL, 7748 bytes)
23/06/14 09:00:51 WARN TaskSetManager: Lost task 9.2 in stage 110.0 (TID 360, 172.17.0.2, executor 4): java.lang.OutOfMemoryError: Could not allocate native memory: std::bad_alloc: RMM failure at:/usr/local/rapids/include/rmm/mr/device/detail/arena.hpp:382: Maximum pool size exceeded
at ai.rapids.cudf.Rmm.allocInternal(Native Method)
at ai.rapids.cudf.Rmm.alloc(Rmm.java:325)
at ai.rapids.cudf.Rmm.alloc(Rmm.java:314)
at ai.rapids.cudf.DeviceMemoryBuffer.allocate(DeviceMemoryBuffer.java:125)
at ai.rapids.cudf.JCudfSerialization.readTableFrom(JCudfSerialization.java:1748)
at ai.rapids.cudf.JCudfSerialization.readTableFrom(JCudfSerialization.java:1793)
at org.apache.spark.sql.rapids.execution.SerializeConcatHostBuffersDeserializeBatch.readObject(GpuBroadcastExchangeExec.scala:89)
at sun.reflect.GeneratedMethodAccessor36.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1184)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2322)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
at org.apache.spark.broadcast.TorrentBroadcast$.$anonfun$unBlockifyObject$4(TorrentBroadcast.scala:328)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:330)
at org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$4(TorrentBroadcast.scala:249)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$2(TorrentBroadcast.scala:223)
at org.apache.spark.util.KeyLock.withLock(KeyLock.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$1(TorrentBroadcast.scala:218)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1343)
at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:218)
at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:103)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at com.nvidia.spark.rapids.shims.spark301.GpuBroadcastHashJoinExec.builtTable$lzycompute$1(GpuBroadcastHashJoinExec.scala:142)
at com.nvidia.spark.rapids.shims.spark301.GpuBroadcastHashJoinExec.builtTable$1(GpuBroadcastHashJoinExec.scala:140)
at com.nvidia.spark.rapids.shims.spark301.GpuBroadcastHashJoinExec.$anonfun$doExecuteColumnar$5(GpuBroadcastHashJoinExec.scala:156)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:837)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:837)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
23/06/14 09:00:52 INFO TaskSetManager: Starting task 9.3 in stage 110.0 (TID 371, 172.17.0.2, executor 5, partition 9, PROCESS_LOCAL, 7748 bytes)
23/06/14 09:00:52 INFO TaskSetManager: Lost task 1.3 in stage 110.0 (TID 367) on 172.17.0.2, executor 5: java.lang.OutOfMemoryError (Could not allocate native memory: std::bad_alloc: RMM failure at:/usr/local/rapids/include/rmm/mr/device/detail/arena.hpp:382: Maximum pool size exceeded) [duplicate 1]
23/06/14 09:00:52 ERROR TaskSetManager: Task 1 in stage 110.0 failed 4 times; aborting job
23/06/14 09:00:52 INFO TaskSchedulerImpl: Cancelling stage 110
23/06/14 09:00:52 INFO TaskSchedulerImpl: Killing all running tasks in stage 110: Stage cancelled
23/06/14 09:00:52 INFO TaskSchedulerImpl: Stage 110 was cancelled
23/06/14 09:00:52 INFO DAGScheduler: ResultStage 110 (collect at GpuRangePartitioner.scala:46) failed in 158.971 s due to Job aborted due to stage failure: Task 1 in stage 110.0 failed 4 times, most recent failure: Lost task 1.3 in stage 110.0 (TID 367, 172.17.0.2, executor 5): java.lang.OutOfMemoryError: Could not allocate native memory: std::bad_alloc: RMM failure at:/usr/local/rapids/include/rmm/mr/device/detail/arena.hpp:382: Maximum pool size exceeded
at ai.rapids.cudf.Rmm.allocInternal(Native Method)
at ai.rapids.cudf.Rmm.alloc(Rmm.java:325)
at ai.rapids.cudf.Rmm.alloc(Rmm.java:314)
at ai.rapids.cudf.DeviceMemoryBuffer.allocate(DeviceMemoryBuffer.java:125)
at ai.rapids.cudf.JCudfSerialization.readTableFrom(JCudfSerialization.java:1748)
at ai.rapids.cudf.JCudfSerialization.readTableFrom(JCudfSerialization.java:1793)
at org.apache.spark.sql.rapids.execution.SerializeConcatHostBuffersDeserializeBatch.readObject(GpuBroadcastExchangeExec.scala:89)
at sun.reflect.GeneratedMethodAccessor44.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1184)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2322)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
at org.apache.spark.broadcast.TorrentBroadcast$.$anonfun$unBlockifyObject$4(TorrentBroadcast.scala:328)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:330)
at org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$4(TorrentBroadcast.scala:249)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$2(TorrentBroadcast.scala:223)
at org.apache.spark.util.KeyLock.withLock(KeyLock.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$1(TorrentBroadcast.scala:218)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1343)
at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:218)
at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:103)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at com.nvidia.spark.rapids.shims.spark301.GpuBroadcastHashJoinExec.builtTable$lzycompute$1(GpuBroadcastHashJoinExec.scala:142)
at com.nvidia.spark.rapids.shims.spark301.GpuBroadcastHashJoinExec.builtTable$1(GpuBroadcastHashJoinExec.scala:140)
at com.nvidia.spark.rapids.shims.spark301.GpuBroadcastHashJoinExec.$anonfun$doExecuteColumnar$5(GpuBroadcastHashJoinExec.scala:156)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:837)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:837)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Driver stacktrace:
23/06/14 09:00:52 INFO DAGScheduler: Job 84 failed: collect at GpuRangePartitioner.scala:46, took 158.987766 s
23/06/14 09:00:52 ERROR GpuFileFormatWriter: Aborting job 7868955a-2f77-4dd0-8a56-d6cce8f52431.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 110.0 failed 4 times, most recent failure: Lost task 1.3 in stage 110.0 (TID 367, 172.17.0.2, executor 5): java.lang.OutOfMemoryError: Could not allocate native memory: std::bad_alloc: RMM failure at:/usr/local/rapids/include/rmm/mr/device/detail/arena.hpp:382: Maximum pool size exceeded
at ai.rapids.cudf.Rmm.allocInternal(Native Method)
at ai.rapids.cudf.Rmm.alloc(Rmm.java:325)
at ai.rapids.cudf.Rmm.alloc(Rmm.java:314)
at ai.rapids.cudf.DeviceMemoryBuffer.allocate(DeviceMemoryBuffer.java:125)
at ai.rapids.cudf.JCudfSerialization.readTableFrom(JCudfSerialization.java:1748)
at ai.rapids.cudf.JCudfSerialization.readTableFrom(JCudfSerialization.java:1793)
at org.apache.spark.sql.rapids.execution.SerializeConcatHostBuffersDeserializeBatch.readObject(GpuBroadcastExchangeExec.scala:89)
at sun.reflect.GeneratedMethodAccessor44.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1184)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2322)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
at org.apache.spark.broadcast.TorrentBroadcast$.$anonfun$unBlockifyObject$4(TorrentBroadcast.scala:328)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:330)
at org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$4(TorrentBroadcast.scala:249)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$2(TorrentBroadcast.scala:223)
at org.apache.spark.util.KeyLock.withLock(KeyLock.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$1(TorrentBroadcast.scala:218)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1343)
at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:218)
at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:103)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at com.nvidia.spark.rapids.shims.spark301.GpuBroadcastHashJoinExec.builtTable$lzycompute$1(GpuBroadcastHashJoinExec.scala:142)
at com.nvidia.spark.rapids.shims.spark301.GpuBroadcastHashJoinExec.builtTable$1(GpuBroadcastHashJoinExec.scala:140)
at com.nvidia.spark.rapids.shims.spark301.GpuBroadcastHashJoinExec.$anonfun$doExecuteColumnar$5(GpuBroadcastHashJoinExec.scala:156)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:837)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:837)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2059)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2008)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2007)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2007)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:973)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:973)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:973)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2239)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2188)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2177)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:775)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2120)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2139)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2164)
at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1004)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
at org.apache.spark.rdd.RDD.collect(RDD.scala:1003)
at com.nvidia.spark.rapids.GpuRangePartitioner.sketch(GpuRangePartitioner.scala:46)
at com.nvidia.spark.rapids.GpuRangePartitioner.createRangeBounds(GpuRangePartitioner.scala:120)
at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExec$.getPartitioner(GpuShuffleExchangeExec.scala:270)
at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExec$.prepareBatchShuffleDependency(GpuShuffleExchangeExec.scala:188)
at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase.shuffleDependencyColumnar$lzycompute(GpuShuffleExchangeExec.scala:134)
at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase.shuffleDependencyColumnar(GpuShuffleExchangeExec.scala:125)
at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase.$anonfun$doExecuteColumnar$1(GpuShuffleExchangeExec.scala:148)
at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase.doExecuteColumnar(GpuShuffleExchangeExec.scala:145)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:202)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:198)
at com.nvidia.spark.rapids.GpuShuffleCoalesceExec.doExecuteColumnar(GpuShuffleCoalesceExec.scala:67)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:202)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:198)
at com.nvidia.spark.rapids.GpuCoalesceBatches.doExecuteColumnar(GpuCoalesceBatches.scala:620)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:202)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:198)
at com.nvidia.spark.rapids.GpuSortExec.doExecuteColumnar(GpuSortExec.scala:94)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:202)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:198)
at com.nvidia.spark.rapids.GpuProjectExec.doExecuteColumnar(basicPhysicalOperators.scala:82)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:202)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:198)
at org.apache.spark.sql.rapids.GpuFileFormatWriter$.write(GpuFileFormatWriter.scala:194)
at org.apache.spark.sql.rapids.GpuInsertIntoHadoopFsRelationCommand.runColumnar(GpuInsertIntoHadoopFsRelationCommand.scala:166)
at com.nvidia.spark.rapids.GpuDataWritingCommandExec.sideEffectResult$lzycompute(GpuDataWritingCommandExec.scala:61)
at com.nvidia.spark.rapids.GpuDataWritingCommandExec.sideEffectResult(GpuDataWritingCommandExec.scala:60)
at com.nvidia.spark.rapids.GpuDataWritingCommandExec.doExecuteColumnar(GpuDataWritingCommandExec.scala:84)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:202)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:198)
at com.nvidia.spark.rapids.GpuColumnarToRowExecParent.doExecute(GpuColumnarToRowExec.scala:286)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:122)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:121)
at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:963)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:963)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:415)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:399)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:288)
at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:848)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.OutOfMemoryError: Could not allocate native memory: std::bad_alloc: RMM failure at:/usr/local/rapids/include/rmm/mr/device/detail/arena.hpp:382: Maximum pool size exceeded
at ai.rapids.cudf.Rmm.allocInternal(Native Method)
at ai.rapids.cudf.Rmm.alloc(Rmm.java:325)
at ai.rapids.cudf.Rmm.alloc(Rmm.java:314)
at ai.rapids.cudf.DeviceMemoryBuffer.allocate(DeviceMemoryBuffer.java:125)
at ai.rapids.cudf.JCudfSerialization.readTableFrom(JCudfSerialization.java:1748)
at ai.rapids.cudf.JCudfSerialization.readTableFrom(JCudfSerialization.java:1793)
at org.apache.spark.sql.rapids.execution.SerializeConcatHostBuffersDeserializeBatch.readObject(GpuBroadcastExchangeExec.scala:89)
at sun.reflect.GeneratedMethodAccessor44.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1184)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2322)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
at org.apache.spark.broadcast.TorrentBroadcast$.$anonfun$unBlockifyObject$4(TorrentBroadcast.scala:328)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:330)
at org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$4(TorrentBroadcast.scala:249)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$2(TorrentBroadcast.scala:223)
at org.apache.spark.util.KeyLock.withLock(KeyLock.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$1(TorrentBroadcast.scala:218)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1343)
at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:218)
at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:103)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at com.nvidia.spark.rapids.shims.spark301.GpuBroadcastHashJoinExec.builtTable$lzycompute$1(GpuBroadcastHashJoinExec.scala:142)
at com.nvidia.spark.rapids.shims.spark301.GpuBroadcastHashJoinExec.builtTable$1(GpuBroadcastHashJoinExec.scala:140)
at com.nvidia.spark.rapids.shims.spark301.GpuBroadcastHashJoinExec.$anonfun$doExecuteColumnar$5(GpuBroadcastHashJoinExec.scala:156)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:837)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:837)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more
Traceback (most recent call last):
File "/workspace/dlrm/preproc/spark_data_utils.py", line 506, in <module>
_main()
File "/workspace/dlrm/preproc/spark_data_utils.py", line 499, in _main
partitionBy=partitionBy)
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 936, in parquet
File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 128, in deco
File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o1185.parquet.
: org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.rapids.GpuFileFormatWriter$.write(GpuFileFormatWriter.scala:250)
at org.apache.spark.sql.rapids.GpuInsertIntoHadoopFsRelationCommand.runColumnar(GpuInsertIntoHadoopFsRelationCommand.scala:166)
at com.nvidia.spark.rapids.GpuDataWritingCommandExec.sideEffectResult$lzycompute(GpuDataWritingCommandExec.scala:61)
at com.nvidia.spark.rapids.GpuDataWritingCommandExec.sideEffectResult(GpuDataWritingCommandExec.scala:60)
at com.nvidia.spark.rapids.GpuDataWritingCommandExec.doExecuteColumnar(GpuDataWritingCommandExec.scala:84)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:202)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:198)
at com.nvidia.spark.rapids.GpuColumnarToRowExecParent.doExecute(GpuColumnarToRowExec.scala:286)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:122)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:121)
at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:963)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:963)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:415)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:399)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:288)
at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:848)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 110.0 failed 4 times, most recent failure: Lost task 1.3 in stage 110.0 (TID 367, 172.17.0.2, executor 5): java.lang.OutOfMemoryError: Could not allocate native memory: std::bad_alloc: RMM failure at:/usr/local/rapids/include/rmm/mr/device/detail/arena.hpp:382: Maximum pool size exceeded
at ai.rapids.cudf.Rmm.allocInternal(Native Method)
at ai.rapids.cudf.Rmm.alloc(Rmm.java:325)
at ai.rapids.cudf.Rmm.alloc(Rmm.java:314)
at ai.rapids.cudf.DeviceMemoryBuffer.allocate(DeviceMemoryBuffer.java:125)
at ai.rapids.cudf.JCudfSerialization.readTableFrom(JCudfSerialization.java:1748)
at ai.rapids.cudf.JCudfSerialization.readTableFrom(JCudfSerialization.java:1793)
at org.apache.spark.sql.rapids.execution.SerializeConcatHostBuffersDeserializeBatch.readObject(GpuBroadcastExchangeExec.scala:89)
at sun.reflect.GeneratedMethodAccessor44.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1184)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2322)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
at org.apache.spark.broadcast.TorrentBroadcast$.$anonfun$unBlockifyObject$4(TorrentBroadcast.scala:328)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:330)
at org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$4(TorrentBroadcast.scala:249)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$2(TorrentBroadcast.scala:223)
at org.apache.spark.util.KeyLock.withLock(KeyLock.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$1(TorrentBroadcast.scala:218)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1343)
at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:218)
at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:103)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at com.nvidia.spark.rapids.shims.spark301.GpuBroadcastHashJoinExec.builtTable$lzycompute$1(GpuBroadcastHashJoinExec.scala:142)
at com.nvidia.spark.rapids.shims.spark301.GpuBroadcastHashJoinExec.builtTable$1(GpuBroadcastHashJoinExec.scala:140)
at com.nvidia.spark.rapids.shims.spark301.GpuBroadcastHashJoinExec.$anonfun$doExecuteColumnar$5(GpuBroadcastHashJoinExec.scala:156)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:837)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:837)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2059)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2008)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2007)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2007)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:973)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:973)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:973)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2239)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2188)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2177)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:775)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2120)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2139)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2164)
at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1004)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
at org.apache.spark.rdd.RDD.collect(RDD.scala:1003)
at com.nvidia.spark.rapids.GpuRangePartitioner.sketch(GpuRangePartitioner.scala:46)
at com.nvidia.spark.rapids.GpuRangePartitioner.createRangeBounds(GpuRangePartitioner.scala:120)
at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExec$.getPartitioner(GpuShuffleExchangeExec.scala:270)
at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExec$.prepareBatchShuffleDependency(GpuShuffleExchangeExec.scala:188)
at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase.shuffleDependencyColumnar$lzycompute(GpuShuffleExchangeExec.scala:134)
at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase.shuffleDependencyColumnar(GpuShuffleExchangeExec.scala:125)
at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase.$anonfun$doExecuteColumnar$1(GpuShuffleExchangeExec.scala:148)
at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase.doExecuteColumnar(GpuShuffleExchangeExec.scala:145)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:202)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:198)
at com.nvidia.spark.rapids.GpuShuffleCoalesceExec.doExecuteColumnar(GpuShuffleCoalesceExec.scala:67)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:202)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:198)
at com.nvidia.spark.rapids.GpuCoalesceBatches.doExecuteColumnar(GpuCoalesceBatches.scala:620)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:202)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:198)
at com.nvidia.spark.rapids.GpuSortExec.doExecuteColumnar(GpuSortExec.scala:94)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:202)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:198)
at com.nvidia.spark.rapids.GpuProjectExec.doExecuteColumnar(basicPhysicalOperators.scala:82)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:202)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:198)
at org.apache.spark.sql.rapids.GpuFileFormatWriter$.write(GpuFileFormatWriter.scala:194)
... 39 more
Caused by: java.lang.OutOfMemoryError: Could not allocate native memory: std::bad_alloc: RMM failure at:/usr/local/rapids/include/rmm/mr/device/detail/arena.hpp:382: Maximum pool size exceeded
at ai.rapids.cudf.Rmm.allocInternal(Native Method)
at ai.rapids.cudf.Rmm.alloc(Rmm.java:325)
at ai.rapids.cudf.Rmm.alloc(Rmm.java:314)
at ai.rapids.cudf.DeviceMemoryBuffer.allocate(DeviceMemoryBuffer.java:125)
at ai.rapids.cudf.JCudfSerialization.readTableFrom(JCudfSerialization.java:1748)
at ai.rapids.cudf.JCudfSerialization.readTableFrom(JCudfSerialization.java:1793)
at org.apache.spark.sql.rapids.execution.SerializeConcatHostBuffersDeserializeBatch.readObject(GpuBroadcastExchangeExec.scala:89)
at sun.reflect.GeneratedMethodAccessor44.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1184)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2322)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
at org.apache.spark.broadcast.TorrentBroadcast$.$anonfun$unBlockifyObject$4(TorrentBroadcast.scala:328)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:330)
at org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$4(TorrentBroadcast.scala:249)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$2(TorrentBroadcast.scala:223)
at org.apache.spark.util.KeyLock.withLock(KeyLock.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$1(TorrentBroadcast.scala:218)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1343)
at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:218)
at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:103)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at com.nvidia.spark.rapids.shims.spark301.GpuBroadcastHashJoinExec.builtTable$lzycompute$1(GpuBroadcastHashJoinExec.scala:142)
at com.nvidia.spark.rapids.shims.spark301.GpuBroadcastHashJoinExec.builtTable$1(GpuBroadcastHashJoinExec.scala:140)
at com.nvidia.spark.rapids.shims.spark301.GpuBroadcastHashJoinExec.$anonfun$doExecuteColumnar$5(GpuBroadcastHashJoinExec.scala:156)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:837)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:837)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more
23/06/14 09:00:52 INFO SparkContext: Invoking stop() from shutdown hook
23/06/14 09:00:52 INFO SparkUI: Stopped Spark web UI at http://c1ee6b3759ad:4040
23/06/14 09:00:52 INFO StandaloneSchedulerBackend: Shutting down all executors
23/06/14 09:00:52 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
23/06/14 09:00:52 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
23/06/14 09:00:52 INFO MemoryStore: MemoryStore cleared
23/06/14 09:00:52 INFO BlockManager: BlockManager stopped
23/06/14 09:00:52 INFO BlockManagerMaster: BlockManagerMaster stopped
23/06/14 09:00:52 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
23/06/14 09:00:52 INFO SparkContext: Successfully stopped SparkContext
23/06/14 09:00:52 INFO ShutdownHookManager: Shutdown hook called
23/06/14 09:00:52 INFO ShutdownHookManager: Deleting directory /tmp/spark-fa523520-3b66-4941-912d-976228e809cb
23/06/14 09:00:52 INFO ShutdownHookManager: Deleting directory /data/dlrm/spark/tmp/spark-8c50cf5d-7a54-43c2-8461-259668e7e0a9
23/06/14 09:00:52 INFO ShutdownHookManager: Deleting directory /data/dlrm/spark/tmp/spark-8c50cf5d-7a54-43c2-8461-259668e7e0a9/pyspark-ff72a275-b26a-4521-9c8e-d510efd54eaf
stopping org.apache.spark.deploy.master.Master
stopping org.apache.spark.deploy.worker.Worker
+ preprocessing_version=Spark
+ conversion_intermediate_dir=/data/dlrm/intermediate_binary
+ final_output_dir=/data/dlrm/binary_dataset
+ source DGX-2_config.sh
++ export TOTAL_CORES=80
++ TOTAL_CORES=80
++ export NUM_EXECUTORS=16
++ NUM_EXECUTORS=16
++ export NUM_EXECUTOR_CORES=5
++ NUM_EXECUTOR_CORES=5
++ export TOTAL_MEMORY=800
++ TOTAL_MEMORY=800
++ export DRIVER_MEMORY=32
++ DRIVER_MEMORY=32
++ export EXECUTOR_MEMORY=32
++ EXECUTOR_MEMORY=32
+ '[' -d /data/dlrm/binary_dataset/train ']'
+ echo 'Performing final conversion to a custom data format'
Performing final conversion to a custom data format
+ python parquet_to_binary.py --parallel_jobs 80 --src_dir /data/dlrm/output --intermediate_dir /data/dlrm/intermediate_binary --dst_dir /data/dlrm/binary_dataset
Processing train files...
0it [00:00, ?it/s]
Train files conversion done
Processing test files...
0it [00:00, ?it/s]
Test files conversion done
Processing validation files...
0it [00:00, ?it/s]
Validation files conversion done
Concatenating train files
cat: '/data/dlrm/intermediate_binary/train/*.bin': No such file or directory
Concatenating test files
cat: '/data/dlrm/intermediate_binary/test/*.bin': No such file or directory
Concatenating validation files
cat: '/data/dlrm/intermediate_binary/validation/*.bin': No such file or directory
Done
+ cp /data/dlrm/output/model_size.json /data/dlrm/binary_dataset/model_size.json
+ python split_dataset.py --dataset /data/dlrm/binary_dataset --output /data/dlrm/binary_dataset/split
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
+ rm /data/dlrm/binary_dataset/train_data.bin
+ rm /data/dlrm/binary_dataset/validation_data.bin
+ rm /data/dlrm/binary_dataset/test_data.bin
+ rm /data/dlrm/binary_dataset/model_size.json
+ mv /data/dlrm/binary_dataset/split/feature_spec.yaml /data/dlrm/binary_dataset/split/test /data/dlrm/binary_dataset/split/train /data/dlrm/binary_dataset/split/validation /data/dlrm/binary_dataset
+ rm -rf /data/dlrm/binary_dataset/split
+ echo 'Done preprocessing the Criteo Kaggle Dataset'
Done preprocessing the Criteo Kaggle Dataset
[root@hpc-wifi ~]# du -sh /mnt/dlrm/dataset_criterio/*
24K /mnt/dlrm/dataset_criterio/binary_dataset
47G /mnt/dlrm/dataset_criterio/day_0
48G /mnt/dlrm/dataset_criterio/day_1
44G /mnt/dlrm/dataset_criterio/day_10
37G /mnt/dlrm/dataset_criterio/day_11
40G /mnt/dlrm/dataset_criterio/day_12
46G /mnt/dlrm/dataset_criterio/day_13
46G /mnt/dlrm/dataset_criterio/day_14
45G /mnt/dlrm/dataset_criterio/day_15
43G /mnt/dlrm/dataset_criterio/day_16
39G /mnt/dlrm/dataset_criterio/day_17
34G /mnt/dlrm/dataset_criterio/day_18
37G /mnt/dlrm/dataset_criterio/day_19
47G /mnt/dlrm/dataset_criterio/day_2
46G /mnt/dlrm/dataset_criterio/day_20
46G /mnt/dlrm/dataset_criterio/day_21
45G /mnt/dlrm/dataset_criterio/day_22
43G /mnt/dlrm/dataset_criterio/day_23
43G /mnt/dlrm/dataset_criterio/day_3
36G /mnt/dlrm/dataset_criterio/day_4
41G /mnt/dlrm/dataset_criterio/day_5
49G /mnt/dlrm/dataset_criterio/day_6
48G /mnt/dlrm/dataset_criterio/day_7
46G /mnt/dlrm/dataset_criterio/day_8
47G /mnt/dlrm/dataset_criterio/day_9
16K /mnt/dlrm/dataset_criterio/intermediate_binary
3.9G /mnt/dlrm/dataset_criterio/output
8.0K /mnt/dlrm/dataset_criterio/spark
[root@hpc-wifi ~]# ll /mnt/dlrm/dataset_criterio/binary_dataset/
total 20
-rw-r--r-- 1 root root 7241 Jun 14 18:01 feature_spec.yaml
drwxr-xr-x 2 root root 4096 Jun 14 18:01 test
drwxr-xr-x 2 root root 4096 Jun 14 18:01 train
drwxr-xr-x 2 root root 4096 Jun 14 18:01 validation
[root@hpc-wifi ~]# du -sh /mnt/dlrm/dataset_criterio/binary_dataset/*
8.0K /mnt/dlrm/dataset_criterio/binary_dataset/feature_spec.yaml
4.0K /mnt/dlrm/dataset_criterio/binary_dataset/test
4.0K /mnt/dlrm/dataset_criterio/binary_dataset/train
4.0K /mnt/dlrm/dataset_criterio/binary_dataset/validation
Memory issue , how can I control memory allocations of spark as my GPUs are T4 X2 (16GB each) ?
Hi @karanveersingh5623, this looks like a CPU out-of-memory issue. Could you please post the output of the following commands so that I can verify your hardware setup?
free -mh
lscpu
nvidia-smi
Thank you, Tomasz
@tgrel , thanks for replying. I will start GPU again and send you the above details but before that I have output from CPU preprocess I dont know why train end up in 0 bytes , no .bin files :(
23/06/14 13:01:51 INFO TaskSetManager: Starting task 24.0 in stage 130.0 (TID 741, 172.17.0.2, executor 1, partition 24, NODE_LOCAL, 7329 bytes)
23/06/14 13:01:51 INFO TaskSetManager: Finished task 10.0 in stage 130.0 (TID 727) in 20099 ms on 172.17.0.2 (executor 1) (5/30)
23/06/14 13:01:51 INFO TaskSetManager: Starting task 25.0 in stage 130.0 (TID 742, 172.17.0.2, executor 0, partition 25, NODE_LOCAL, 7329 bytes)
23/06/14 13:01:51 INFO TaskSetManager: Starting task 26.0 in stage 130.0 (TID 743, 172.17.0.2, executor 1, partition 26, NODE_LOCAL, 7329 bytes)
23/06/14 13:01:51 INFO TaskSetManager: Finished task 19.0 in stage 130.0 (TID 736) in 20249 ms on 172.17.0.2 (executor 0) (6/30)
23/06/14 13:01:51 INFO TaskSetManager: Finished task 0.0 in stage 130.0 (TID 717) in 20252 ms on 172.17.0.2 (executor 1) (7/30)
23/06/14 13:01:51 INFO TaskSetManager: Starting task 27.0 in stage 130.0 (TID 744, 172.17.0.2, executor 1, partition 27, NODE_LOCAL, 7329 bytes)
23/06/14 13:01:51 INFO TaskSetManager: Finished task 4.0 in stage 130.0 (TID 721) in 20429 ms on 172.17.0.2 (executor 1) (8/30)
23/06/14 13:01:51 INFO TaskSetManager: Starting task 28.0 in stage 130.0 (TID 745, 172.17.0.2, executor 1, partition 28, NODE_LOCAL, 7329 bytes)
23/06/14 13:01:51 INFO TaskSetManager: Finished task 12.0 in stage 130.0 (TID 729) in 20523 ms on 172.17.0.2 (executor 1) (9/30)
23/06/14 13:01:52 INFO TaskSetManager: Starting task 29.0 in stage 130.0 (TID 746, 172.17.0.2, executor 1, partition 29, NODE_LOCAL, 7329 bytes)
23/06/14 13:01:52 INFO TaskSetManager: Finished task 18.0 in stage 130.0 (TID 735) in 20672 ms on 172.17.0.2 (executor 1) (10/30)
23/06/14 13:01:52 INFO TaskSetManager: Finished task 13.0 in stage 130.0 (TID 730) in 21198 ms on 172.17.0.2 (executor 0) (11/30)
23/06/14 13:01:52 INFO TaskSetManager: Finished task 1.0 in stage 130.0 (TID 718) in 21408 ms on 172.17.0.2 (executor 0) (12/30)
23/06/14 13:01:52 INFO TaskSetManager: Finished task 2.0 in stage 130.0 (TID 719) in 21459 ms on 172.17.0.2 (executor 1) (13/30)
23/06/14 13:01:52 INFO TaskSetManager: Finished task 5.0 in stage 130.0 (TID 722) in 21538 ms on 172.17.0.2 (executor 0) (14/30)
23/06/14 13:01:53 INFO TaskSetManager: Finished task 7.0 in stage 130.0 (TID 724) in 21605 ms on 172.17.0.2 (executor 0) (15/30)
23/06/14 13:01:53 INFO TaskSetManager: Finished task 11.0 in stage 130.0 (TID 728) in 21796 ms on 172.17.0.2 (executor 0) (16/30)
23/06/14 13:01:53 INFO TaskSetManager: Finished task 15.0 in stage 130.0 (TID 732) in 21931 ms on 172.17.0.2 (executor 0) (17/30)
23/06/14 13:01:54 INFO TaskSetManager: Finished task 9.0 in stage 130.0 (TID 726) in 22686 ms on 172.17.0.2 (executor 0) (18/30)
23/06/14 13:01:54 INFO TaskSetManager: Finished task 16.0 in stage 130.0 (TID 733) in 23062 ms on 172.17.0.2 (executor 1) (19/30)
23/06/14 13:01:54 INFO TaskSetManager: Finished task 3.0 in stage 130.0 (TID 720) in 23107 ms on 172.17.0.2 (executor 0) (20/30)
23/06/14 13:02:07 INFO TaskSetManager: Finished task 20.0 in stage 130.0 (TID 737) in 18266 ms on 172.17.0.2 (executor 1) (21/30)
23/06/14 13:02:08 INFO TaskSetManager: Finished task 22.0 in stage 130.0 (TID 739) in 17050 ms on 172.17.0.2 (executor 1) (22/30)
23/06/14 13:02:08 INFO TaskSetManager: Finished task 27.0 in stage 130.0 (TID 744) in 17084 ms on 172.17.0.2 (executor 1) (23/30)
23/06/14 13:02:09 INFO TaskSetManager: Finished task 23.0 in stage 130.0 (TID 740) in 18019 ms on 172.17.0.2 (executor 0) (24/30)
23/06/14 13:02:09 INFO TaskSetManager: Finished task 21.0 in stage 130.0 (TID 738) in 18957 ms on 172.17.0.2 (executor 1) (25/30)
23/06/14 13:02:09 INFO TaskSetManager: Finished task 25.0 in stage 130.0 (TID 742) in 18253 ms on 172.17.0.2 (executor 0) (26/30)
23/06/14 13:02:10 INFO TaskSetManager: Finished task 26.0 in stage 130.0 (TID 743) in 18857 ms on 172.17.0.2 (executor 1) (27/30)
23/06/14 13:02:10 INFO TaskSetManager: Finished task 28.0 in stage 130.0 (TID 745) in 18689 ms on 172.17.0.2 (executor 1) (28/30)
23/06/14 13:02:12 INFO TaskSetManager: Finished task 24.0 in stage 130.0 (TID 741) in 20546 ms on 172.17.0.2 (executor 1) (29/30)
23/06/14 13:02:12 INFO TaskSetManager: Finished task 29.0 in stage 130.0 (TID 746) in 20321 ms on 172.17.0.2 (executor 1) (30/30)
23/06/14 13:02:12 INFO TaskSchedulerImpl: Removed TaskSet 130.0, whose tasks have all completed, from pool
23/06/14 13:02:12 INFO DAGScheduler: ResultStage 130 (parquet at NativeMethodAccessorImpl.java:0) finished in 41.026 s
23/06/14 13:02:12 INFO DAGScheduler: Job 79 is finished. Cancelling potential speculative or zombie tasks for this job
23/06/14 13:02:12 INFO TaskSchedulerImpl: Killing all running tasks in stage 130: Stage finished
23/06/14 13:02:12 INFO DAGScheduler: Job 79 finished: parquet at NativeMethodAccessorImpl.java:0, took 93.207300 s
23/06/14 13:02:12 INFO FileFormatWriter: Write Job 41f22ef5-860a-4b9a-b4d5-520892209a9d committed.
23/06/14 13:02:12 INFO FileFormatWriter: Finished processing stats for write job 41f22ef5-860a-4b9a-b4d5-520892209a9d.
====================================================================================================
{'transform': 543.4746820926666}
23/06/14 13:02:12 INFO SparkContext: Invoking stop() from shutdown hook
23/06/14 13:02:12 INFO BlockManagerInfo: Removed broadcast_131_piece0 on c1ee6b3759ad:44033 in memory (size: 5.6 KiB, free: 16.9 GiB)
23/06/14 13:02:12 INFO BlockManagerInfo: Removed broadcast_131_piece0 on 172.17.0.2:41086 in memory (size: 5.6 KiB, free: 51.0 GiB)
23/06/14 13:02:12 INFO BlockManagerInfo: Removed broadcast_131_piece0 on 172.17.0.2:39997 in memory (size: 5.6 KiB, free: 51.0 GiB)
23/06/14 13:02:12 INFO BlockManagerInfo: Removed broadcast_146_piece0 on c1ee6b3759ad:44033 in memory (size: 5.6 KiB, free: 16.9 GiB)
23/06/14 13:02:12 INFO BlockManagerInfo: Removed broadcast_146_piece0 on 172.17.0.2:41086 in memory (size: 5.6 KiB, free: 51.0 GiB)
23/06/14 13:02:12 INFO BlockManagerInfo: Removed broadcast_197_piece0 on c1ee6b3759ad:44033 in memory (size: 43.0 KiB, free: 16.9 GiB)
23/06/14 13:02:12 INFO BlockManagerInfo: Removed broadcast_197_piece0 on 172.17.0.2:41086 in memory (size: 43.0 KiB, free: 51.0 GiB)
23/06/14 13:02:12 INFO BlockManagerInfo: Removed broadcast_197_piece0 on 172.17.0.2:39997 in memory (size: 43.0 KiB, free: 51.0 GiB)
23/06/14 13:02:12 INFO SparkUI: Stopped Spark web UI at http://c1ee6b3759ad:4040
23/06/14 13:02:12 INFO StandaloneSchedulerBackend: Shutting down all executors
23/06/14 13:02:12 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
23/06/14 13:02:12 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
23/06/14 13:02:12 INFO MemoryStore: MemoryStore cleared
23/06/14 13:02:12 INFO BlockManager: BlockManager stopped
23/06/14 13:02:12 INFO BlockManagerMaster: BlockManagerMaster stopped
23/06/14 13:02:12 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
23/06/14 13:02:12 INFO SparkContext: Successfully stopped SparkContext
23/06/14 13:02:12 INFO ShutdownHookManager: Shutdown hook called
23/06/14 13:02:12 INFO ShutdownHookManager: Deleting directory /tmp/spark-131ba254-ca77-4d3f-9dba-9f92bc2cab71
23/06/14 13:02:12 INFO ShutdownHookManager: Deleting directory /data/dlrm/spark/tmp/spark-679ccf26-43ae-445d-ba62-1adc004b4e3e/pyspark-2f7b78df-2566-4a6d-91bf-67b6ab5c1385
23/06/14 13:02:12 INFO ShutdownHookManager: Deleting directory /data/dlrm/spark/tmp/spark-679ccf26-43ae-445d-ba62-1adc004b4e3e
+ preprocessing_version=Spark
+ conversion_intermediate_dir=/data/dlrm/intermediate_binary
+ final_output_dir=/data/dlrm/binary_dataset
+ source DGX-2_config.sh
++ export TOTAL_CORES=80
++ TOTAL_CORES=80
++ export NUM_EXECUTORS=16
++ NUM_EXECUTORS=16
++ export NUM_EXECUTOR_CORES=5
++ NUM_EXECUTOR_CORES=5
++ export TOTAL_MEMORY=800
++ TOTAL_MEMORY=800
++ export DRIVER_MEMORY=32
++ DRIVER_MEMORY=32
++ export EXECUTOR_MEMORY=32
++ EXECUTOR_MEMORY=32
+ '[' -d /data/dlrm/binary_dataset/train ']'
+ echo 'Performing final conversion to a custom data format'
Performing final conversion to a custom data format
+ python parquet_to_binary.py --parallel_jobs 80 --src_dir /data/dlrm/output --intermediate_dir /data/dlrm/intermediate_binary --dst_dir /data/dlrm/binary_dataset
Processing train files...
0it [00:00, ?it/s]
Train files conversion done
Processing test files...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [00:00<00:00, 50963.60it/s]
Test files conversion done
Processing validation files...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [00:00<00:00, 14993.94it/s]
Validation files conversion done
Concatenating train files
cat: '/data/dlrm/intermediate_binary/train/*.bin': No such file or directory
Concatenating test files
Concatenating validation files
Done
+ cp /data/dlrm/output/model_size.json /data/dlrm/binary_dataset/model_size.json
+ python split_dataset.py --dataset /data/dlrm/binary_dataset --output /data/dlrm/binary_dataset/split
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2721/2721 [00:20<00:00, 132.62it/s]
0it [00:00, ?it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2721/2721 [00:20<00:00, 130.68it/s]
+ rm /data/dlrm/binary_dataset/train_data.bin
+ rm /data/dlrm/binary_dataset/validation_data.bin
+ rm /data/dlrm/binary_dataset/test_data.bin
+ rm /data/dlrm/binary_dataset/model_size.json
+ mv /data/dlrm/binary_dataset/split/feature_spec.yaml /data/dlrm/binary_dataset/split/test /data/dlrm/binary_dataset/split/train /data/dlrm/binary_dataset/split/validation /data/dlrm/binary_dataset
+ rm -rf /data/dlrm/binary_dataset/split
+ echo 'Done preprocessing the Criteo Kaggle Dataset'
Done preprocessing the Criteo Kaggle Dataset
[root@hpc-wifi ~]# du -sh /mnt/dlrm/dataset_criterio/*
15G /mnt/dlrm/dataset_criterio/binary_dataset
47G /mnt/dlrm/dataset_criterio/day_0
48G /mnt/dlrm/dataset_criterio/day_1
44G /mnt/dlrm/dataset_criterio/day_10
37G /mnt/dlrm/dataset_criterio/day_11
40G /mnt/dlrm/dataset_criterio/day_12
46G /mnt/dlrm/dataset_criterio/day_13
46G /mnt/dlrm/dataset_criterio/day_14
45G /mnt/dlrm/dataset_criterio/day_15
43G /mnt/dlrm/dataset_criterio/day_16
39G /mnt/dlrm/dataset_criterio/day_17
34G /mnt/dlrm/dataset_criterio/day_18
37G /mnt/dlrm/dataset_criterio/day_19
47G /mnt/dlrm/dataset_criterio/day_2
46G /mnt/dlrm/dataset_criterio/day_20
46G /mnt/dlrm/dataset_criterio/day_21
45G /mnt/dlrm/dataset_criterio/day_22
43G /mnt/dlrm/dataset_criterio/day_23
43G /mnt/dlrm/dataset_criterio/day_3
36G /mnt/dlrm/dataset_criterio/day_4
41G /mnt/dlrm/dataset_criterio/day_5
49G /mnt/dlrm/dataset_criterio/day_6
48G /mnt/dlrm/dataset_criterio/day_7
46G /mnt/dlrm/dataset_criterio/day_8
47G /mnt/dlrm/dataset_criterio/day_9
27G /mnt/dlrm/dataset_criterio/intermediate_binary
20G /mnt/dlrm/dataset_criterio/output
12K /mnt/dlrm/dataset_criterio/spark
[root@hpc-wifi ~]# ll /mnt/dlrm/dataset_criterio/intermediate_binary/
total 12
drwxr-xr-x 2 root root 4096 Jun 14 22:02 test
drwxr-xr-x 2 root root 4096 Jun 14 22:02 train
drwxr-xr-x 2 root root 4096 Jun 14 22:03 validation
[root@hpc-wifi ~]# ll /mnt/dlrm/dataset_criterio/intermediate_binary/train/
total 0
[root@hpc-wifi ~]# du -sh /mnt/dlrm/dataset_criterio/intermediate_binary/*
14G /mnt/dlrm/dataset_criterio/intermediate_binary/test
4.0K /mnt/dlrm/dataset_criterio/intermediate_binary/train
14G /mnt/dlrm/dataset_criterio/intermediate_binary/validation
@tgrel , here are the details you requested .
[root@hpc-wifi ~]# free -mh
total used free shared buff/cache available
Mem: 251G 12G 139G 32G 100G 206G
Swap: 4.0G 1.3G 2.7G
[root@hpc-wifi ~]#
[root@hpc-wifi ~]#
[root@hpc-wifi ~]# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 88
On-line CPU(s) list: 0-87
Thread(s) per core: 2
Core(s) per socket: 22
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6152 CPU @ 2.10GHz
Stepping: 4
CPU MHz: 2100.000
BogoMIPS: 4200.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 30976K
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79,81,83,85,87
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 invpcid_single intel_ppin intel_pt ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke md_clear spec_ctrl intel_stibp flush_l1d arch_capabilities
[root@hpc-wifi ~]#
[root@hpc-wifi ~]#
[root@hpc-wifi ~]# nvidia-smi
Sat Jun 17 18:54:45 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.43.04 Driver Version: 515.43.04 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:3B:00.0 Off | 0 |
| N/A 45C P0 62W / 70W | 14117MiB / 15360MiB | 78% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla T4 Off | 00000000:AF:00.0 Off | 0 |
| N/A 47C P0 35W / 70W | 14117MiB / 15360MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 60506 C ...-8-openjdk-amd64/bin/java 14081MiB |
| 1 N/A N/A 60507 C ...-8-openjdk-amd64/bin/java 14081MiB |
+-----------------------------------------------------------------------------+
@tgrel .....failed at same
23/06/17 10:19:03 INFO BlockManagerInfo: Added broadcast_199_piece179 in memory on 172.17.0.2:35676 (size: 4.0 MiB, free: 12.4 GiB)
23/06/17 10:19:03 INFO BlockManagerInfo: Added broadcast_199_piece29 in memory on 172.17.0.2:35676 (size: 4.0 MiB, free: 12.4 GiB)
23/06/17 10:19:03 INFO BlockManagerInfo: Added broadcast_199_piece92 in memory on 172.17.0.2:35676 (size: 4.0 MiB, free: 12.4 GiB)
23/06/17 10:19:04 INFO BlockManagerInfo: Added broadcast_199_piece5 in memory on 172.17.0.2:35676 (size: 4.0 MiB, free: 12.4 GiB)
23/06/17 10:19:04 INFO BlockManagerInfo: Added broadcast_199_piece117 in memory on 172.17.0.2:35676 (size: 4.0 MiB, free: 12.4 GiB)
23/06/17 10:19:04 INFO BlockManagerInfo: Added broadcast_199_piece175 in memory on 172.17.0.2:35676 (size: 4.0 MiB, free: 12.4 GiB)
23/06/17 10:19:04 INFO BlockManagerInfo: Added broadcast_199_piece112 in memory on 172.17.0.2:35676 (size: 4.0 MiB, free: 12.4 GiB)
23/06/17 10:19:04 INFO BlockManagerInfo: Added broadcast_199_piece101 in memory on 172.17.0.2:35676 (size: 4.0 MiB, free: 12.4 GiB)
23/06/17 10:19:04 INFO BlockManagerInfo: Added broadcast_199_piece141 in memory on 172.17.0.2:35676 (size: 4.0 MiB, free: 12.4 GiB)
23/06/17 10:19:04 INFO BlockManagerInfo: Added broadcast_199_piece71 in memory on 172.17.0.2:35676 (size: 4.0 MiB, free: 12.4 GiB)
23/06/17 10:19:04 INFO BlockManagerInfo: Added broadcast_199_piece189 in memory on 172.17.0.2:35676 (size: 4.0 MiB, free: 12.4 GiB)
23/06/17 10:19:04 INFO BlockManagerInfo: Added broadcast_199_piece163 in memory on 172.17.0.2:35676 (size: 4.0 MiB, free: 12.4 GiB)
23/06/17 10:19:04 INFO BlockManagerInfo: Added broadcast_199_piece103 in memory on 172.17.0.2:35676 (size: 4.0 MiB, free: 12.4 GiB)
23/06/17 10:19:04 INFO TaskSetManager: Starting task 1.2 in stage 110.0 (TID 370, 172.17.0.2, executor 5, partition 1, PROCESS_LOCAL, 7748 bytes)
23/06/17 10:19:04 WARN TaskSetManager: Lost task 7.2 in stage 110.0 (TID 360, 172.17.0.2, executor 5): java.lang.OutOfMemoryError: Could not allocate native memory: std::bad_alloc: RMM failure at:/usr/local/rapids/include/rmm/mr/device/detail/arena.hpp:382: Maximum pool size exceeded
at ai.rapids.cudf.Rmm.allocInternal(Native Method)
at ai.rapids.cudf.Rmm.alloc(Rmm.java:325)
at ai.rapids.cudf.Rmm.alloc(Rmm.java:314)
at ai.rapids.cudf.DeviceMemoryBuffer.allocate(DeviceMemoryBuffer.java:125)
at ai.rapids.cudf.JCudfSerialization.readTableFrom(JCudfSerialization.java:1748)
at ai.rapids.cudf.JCudfSerialization.readTableFrom(JCudfSerialization.java:1793)
at org.apache.spark.sql.rapids.execution.SerializeConcatHostBuffersDeserializeBatch.readObject(GpuBroadcastExchangeExec.scala:89)
at sun.reflect.GeneratedMethodAccessor50.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1184)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2322)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
at org.apache.spark.broadcast.TorrentBroadcast$.$anonfun$unBlockifyObject$4(TorrentBroadcast.scala:328)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:330)
at org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$4(TorrentBroadcast.scala:249)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$2(TorrentBroadcast.scala:223)
at org.apache.spark.util.KeyLock.withLock(KeyLock.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$1(TorrentBroadcast.scala:218)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1343)
at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:218)
at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:103)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at com.nvidia.spark.rapids.shims.spark301.GpuBroadcastHashJoinExec.builtTable$lzycompute$1(GpuBroadcastHashJoinExec.scala:142)
at com.nvidia.spark.rapids.shims.spark301.GpuBroadcastHashJoinExec.builtTable$1(GpuBroadcastHashJoinExec.scala:140)
at com.nvidia.spark.rapids.shims.spark301.GpuBroadcastHashJoinExec.$anonfun$doExecuteColumnar$5(GpuBroadcastHashJoinExec.scala:156)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:837)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:837)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
23/06/17 10:19:05 INFO TaskSetManager: Starting task 7.3 in stage 110.0 (TID 371, 172.17.0.2, executor 4, partition 7, PROCESS_LOCAL, 7748 bytes)
23/06/17 10:19:05 INFO TaskSetManager: Lost task 4.2 in stage 110.0 (TID 369) on 172.17.0.2, executor 4: java.lang.OutOfMemoryError (Could not allocate native memory: std::bad_alloc: RMM failure at:/usr/local/rapids/include/rmm/mr/device/detail/arena.hpp:382: Maximum pool size exceeded) [duplicate 1]
23/06/17 10:19:08 ERROR TaskSchedulerImpl: Lost executor 5 on 172.17.0.2: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
23/06/17 10:19:08 WARN TaskSetManager: Lost task 9.2 in stage 110.0 (TID 364, 172.17.0.2, executor 5): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
23/06/17 10:19:08 WARN TaskSetManager: Lost task 1.2 in stage 110.0 (TID 370, 172.17.0.2, executor 5): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
23/06/17 10:19:08 WARN TaskSetManager: Lost task 3.2 in stage 110.0 (TID 361, 172.17.0.2, executor 5): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
23/06/17 10:19:08 WARN TaskSetManager: Lost task 10.2 in stage 110.0 (TID 363, 172.17.0.2, executor 5): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
23/06/17 10:19:08 WARN TaskSetManager: Lost task 2.3 in stage 110.0 (TID 362, 172.17.0.2, executor 5): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
23/06/17 10:19:08 ERROR TaskSetManager: Task 2 in stage 110.0 failed 4 times; aborting job
23/06/17 10:19:08 INFO TaskSchedulerImpl: Cancelling stage 110
23/06/17 10:19:08 INFO TaskSchedulerImpl: Killing all running tasks in stage 110: Stage cancelled
23/06/17 10:19:08 INFO TaskSchedulerImpl: Stage 110 was cancelled
23/06/17 10:19:08 INFO DAGScheduler: ResultStage 110 (collect at GpuRangePartitioner.scala:46) failed in 161.305 s due to Job aborted due to stage failure: Task 2 in stage 110.0 failed 4 times, most recent failure: Lost task 2.3 in stage 110.0 (TID 362, 172.17.0.2, executor 5): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
Driver stacktrace:
23/06/17 10:19:08 INFO DAGScheduler: Job 84 failed: collect at GpuRangePartitioner.scala:46, took 161.321465 s
23/06/17 10:19:08 INFO DAGScheduler: Executor lost: 5 (epoch 30)
23/06/17 10:19:08 INFO BlockManagerMasterEndpoint: Trying to remove executor 5 from BlockManagerMaster.
23/06/17 10:19:08 ERROR GpuFileFormatWriter: Aborting job 752a47d1-0f88-4475-9599-71fe0c84f8f5.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 110.0 failed 4 times, most recent failure: Lost task 2.3 in stage 110.0 (TID 362, 172.17.0.2, executor 5): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2059)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2008)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2007)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2007)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:973)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:973)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:973)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2239)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2188)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2177)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:775)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2120)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2139)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2164)
at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1004)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
at org.apache.spark.rdd.RDD.collect(RDD.scala:1003)
at com.nvidia.spark.rapids.GpuRangePartitioner.sketch(GpuRangePartitioner.scala:46)
at com.nvidia.spark.rapids.GpuRangePartitioner.createRangeBounds(GpuRangePartitioner.scala:120)
at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExec$.getPartitioner(GpuShuffleExchangeExec.scala:270)
at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExec$.prepareBatchShuffleDependency(GpuShuffleExchangeExec.scala:188)
at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase.shuffleDependencyColumnar$lzycompute(GpuShuffleExchangeExec.scala:134)
at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase.shuffleDependencyColumnar(GpuShuffleExchangeExec.scala:125)
at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase.$anonfun$doExecuteColumnar$1(GpuShuffleExchangeExec.scala:148)
at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase.doExecuteColumnar(GpuShuffleExchangeExec.scala:145)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:202)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:198)
at com.nvidia.spark.rapids.GpuShuffleCoalesceExec.doExecuteColumnar(GpuShuffleCoalesceExec.scala:67)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:202)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:198)
at com.nvidia.spark.rapids.GpuCoalesceBatches.doExecuteColumnar(GpuCoalesceBatches.scala:620)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:202)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:198)
at com.nvidia.spark.rapids.GpuSortExec.doExecuteColumnar(GpuSortExec.scala:94)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:202)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:198)
at com.nvidia.spark.rapids.GpuProjectExec.doExecuteColumnar(basicPhysicalOperators.scala:82)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:202)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:198)
at org.apache.spark.sql.rapids.GpuFileFormatWriter$.write(GpuFileFormatWriter.scala:194)
at org.apache.spark.sql.rapids.GpuInsertIntoHadoopFsRelationCommand.runColumnar(GpuInsertIntoHadoopFsRelationCommand.scala:166)
at com.nvidia.spark.rapids.GpuDataWritingCommandExec.sideEffectResult$lzycompute(GpuDataWritingCommandExec.scala:61)
at com.nvidia.spark.rapids.GpuDataWritingCommandExec.sideEffectResult(GpuDataWritingCommandExec.scala:60)
at com.nvidia.spark.rapids.GpuDataWritingCommandExec.doExecuteColumnar(GpuDataWritingCommandExec.scala:84)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:202)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:198)
at com.nvidia.spark.rapids.GpuColumnarToRowExecParent.doExecute(GpuColumnarToRowExec.scala:286)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:122)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:121)
at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:963)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:963)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:415)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:399)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:288)
at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:848)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:750)
23/06/17 10:19:08 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(5, 172.17.0.2, 40371, None)
23/06/17 10:19:08 INFO BlockManagerMaster: Removed 5 successfully in removeExecutor
23/06/17 10:19:08 INFO DAGScheduler: Shuffle files lost for executor: 5 (epoch 30)
Traceback (most recent call last):
File "/workspace/dlrm/preproc/spark_data_utils.py", line 506, in <module>
_main()
File "/workspace/dlrm/preproc/spark_data_utils.py", line 499, in _main
partitionBy=partitionBy)
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 936, in parquet
File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 128, in deco
File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o1186.parquet.
: org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.rapids.GpuFileFormatWriter$.write(GpuFileFormatWriter.scala:250)
at org.apache.spark.sql.rapids.GpuInsertIntoHadoopFsRelationCommand.runColumnar(GpuInsertIntoHadoopFsRelationCommand.scala:166)
at com.nvidia.spark.rapids.GpuDataWritingCommandExec.sideEffectResult$lzycompute(GpuDataWritingCommandExec.scala:61)
at com.nvidia.spark.rapids.GpuDataWritingCommandExec.sideEffectResult(GpuDataWritingCommandExec.scala:60)
at com.nvidia.spark.rapids.GpuDataWritingCommandExec.doExecuteColumnar(GpuDataWritingCommandExec.scala:84)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:202)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:198)
at com.nvidia.spark.rapids.GpuColumnarToRowExecParent.doExecute(GpuColumnarToRowExec.scala:286)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:122)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:121)
at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:963)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:963)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:415)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:399)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:288)
at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:848)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 110.0 failed 4 times, most recent failure: Lost task 2.3 in stage 110.0 (TID 362, 172.17.0.2, executor 5): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2059)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2008)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2007)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2007)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:973)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:973)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:973)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2239)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2188)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2177)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:775)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2120)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2139)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2164)
at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1004)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
at org.apache.spark.rdd.RDD.collect(RDD.scala:1003)
at com.nvidia.spark.rapids.GpuRangePartitioner.sketch(GpuRangePartitioner.scala:46)
at com.nvidia.spark.rapids.GpuRangePartitioner.createRangeBounds(GpuRangePartitioner.scala:120)
at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExec$.getPartitioner(GpuShuffleExchangeExec.scala:270)
at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExec$.prepareBatchShuffleDependency(GpuShuffleExchangeExec.scala:188)
at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase.shuffleDependencyColumnar$lzycompute(GpuShuffleExchangeExec.scala:134)
at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase.shuffleDependencyColumnar(GpuShuffleExchangeExec.scala:125)
at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase.$anonfun$doExecuteColumnar$1(GpuShuffleExchangeExec.scala:148)
at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase.doExecuteColumnar(GpuShuffleExchangeExec.scala:145)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:202)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:198)
at com.nvidia.spark.rapids.GpuShuffleCoalesceExec.doExecuteColumnar(GpuShuffleCoalesceExec.scala:67)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:202)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:198)
at com.nvidia.spark.rapids.GpuCoalesceBatches.doExecuteColumnar(GpuCoalesceBatches.scala:620)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:202)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:198)
at com.nvidia.spark.rapids.GpuSortExec.doExecuteColumnar(GpuSortExec.scala:94)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:202)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:198)
at com.nvidia.spark.rapids.GpuProjectExec.doExecuteColumnar(basicPhysicalOperators.scala:82)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:202)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:198)
at org.apache.spark.sql.rapids.GpuFileFormatWriter$.write(GpuFileFormatWriter.scala:194)
... 39 more
23/06/17 10:19:08 INFO SparkContext: Invoking stop() from shutdown hook
23/06/17 10:19:08 INFO SparkUI: Stopped Spark web UI at http://c1ee6b3759ad:4040
23/06/17 10:19:08 INFO StandaloneSchedulerBackend: Shutting down all executors
23/06/17 10:19:08 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
23/06/17 10:19:08 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
23/06/17 10:19:08 INFO MemoryStore: MemoryStore cleared
23/06/17 10:19:08 INFO BlockManager: BlockManager stopped
23/06/17 10:19:08 INFO BlockManagerMaster: BlockManagerMaster stopped
23/06/17 10:19:08 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
23/06/17 10:19:08 INFO SparkContext: Successfully stopped SparkContext
23/06/17 10:19:08 INFO ShutdownHookManager: Shutdown hook called
23/06/17 10:19:08 INFO ShutdownHookManager: Deleting directory /data/dlrm/spark/tmp/spark-530d41b5-b5e3-4cc4-8de0-94d4f44acb29
23/06/17 10:19:08 INFO ShutdownHookManager: Deleting directory /data/dlrm/spark/tmp/spark-530d41b5-b5e3-4cc4-8de0-94d4f44acb29/pyspark-757ec719-3fdf-48e3-993c-ee7630fbc06c
23/06/17 10:19:08 INFO ShutdownHookManager: Deleting directory /tmp/spark-765d0a7a-b253-4a6c-bde2-094058b42b4b
stopping org.apache.spark.deploy.master.Master
stopping org.apache.spark.deploy.worker.Worker
+ preprocessing_version=Spark
+ conversion_intermediate_dir=/data/dlrm/intermediate_binary
+ final_output_dir=/data/dlrm/binary_dataset
+ source DGX-2_config.sh
++ export TOTAL_CORES=80
++ TOTAL_CORES=80
++ export NUM_EXECUTORS=16
++ NUM_EXECUTORS=16
++ export NUM_EXECUTOR_CORES=5
++ NUM_EXECUTOR_CORES=5
++ export TOTAL_MEMORY=800
++ TOTAL_MEMORY=800
++ export DRIVER_MEMORY=32
++ DRIVER_MEMORY=32
++ export EXECUTOR_MEMORY=32
++ EXECUTOR_MEMORY=32
+ '[' -d /data/dlrm/binary_dataset/train ']'
+ echo 'Performing final conversion to a custom data format'
Performing final conversion to a custom data format
+ python parquet_to_binary.py --parallel_jobs 80 --src_dir /data/dlrm/output --intermediate_dir /data/dlrm/intermediate_binary --dst_dir /data/dlrm/binary_dataset
Processing train files...
0it [00:00, ?it/s]
Train files conversion done
Processing test files...
0it [00:00, ?it/s]
Test files conversion done
Processing validation files...
0it [00:00, ?it/s]
Validation files conversion done
Concatenating train files
cat: '/data/dlrm/intermediate_binary/train/*.bin': No such file or directory
Concatenating test files
cat: '/data/dlrm/intermediate_binary/test/*.bin': No such file or directory
Concatenating validation files
cat: '/data/dlrm/intermediate_binary/validation/*.bin': No such file or directory
Done
+ cp /data/dlrm/output/model_size.json /data/dlrm/binary_dataset/model_size.json
+ python split_dataset.py --dataset /data/dlrm/binary_dataset --output /data/dlrm/binary_dataset/split
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
+ rm /data/dlrm/binary_dataset/train_data.bin
+ rm /data/dlrm/binary_dataset/validation_data.bin
+ rm /data/dlrm/binary_dataset/test_data.bin
+ rm /data/dlrm/binary_dataset/model_size.json
+ mv /data/dlrm/binary_dataset/split/feature_spec.yaml /data/dlrm/binary_dataset/split/test /data/dlrm/binary_dataset/split/train /data/dlrm/binary_dataset/split/validation /data/dlrm/binary_dataset
+ rm -rf /data/dlrm/binary_dataset/split
+ echo 'Done preprocessing the Criteo Kaggle Dataset'
Done preprocessing the Criteo Kaggle Dataset
@tgrel , anything on above ?
Hi @karanveersingh5623,
I think the problem is caused by too low memory on your system. I've never tested the Criteo 1TB preprocessing on T4 GPUs, nor on machines with less than 1TB of CPU memory. Please note this dataset is very large and unfortunately the hardware requirements are quite stringent. I am going to clearly spell out those requirements in a future release.
If you're unable to get a machine with more memory and don't need to test for convergence, then I suggest using synthetic data. This should be achievable by passing --dataset_type synthetic_gpu
to the main training script.
Hi @tgrel I found this report/thread because I'm having the same Spark resource problem when trying to preprocess the DLRM TF2 example. However, as a proof of process, I've reduced my criteo/day_0...day_23 files so that they only have 1000 lines in each, vastly reducing the dtaaset, and are therefore pseudo-synthetic data, yet I still get the same WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources I was just wondering if the "stringent" hardware requirements that you mentioned above are effectively hard-coded and not based on the size of the input files? I noticed in the README.md mention of 3Tb and 4Tb disk space requirements, for example, and I was hoping that reducing the file sizes would reduce the disk and memory requirements to where they would sit comfortably in my environment, but at the moment it seems that this is not the case... Apologies if I'm commenting out of place... Thanks!
Hi @psharpe99,
You are correct. The hardware requirements are indeed specifically tuned to preprocess the full Criteo 1TB dataset. This example doesn't set out to support all possible datasets and hardware platforms automatically. Unfortunately, hand-tuning is required.
If you'd like to run with a different dataset or different hardware, you'll need to modify the hardcoded values to suit your setup. A good place to start would be to add a new config file like this one: https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/Recommendation/DLRM/preproc/DGX-A100_config.sh
and then to rebuild the preprocessing docker image passing the correct config name and other arguments: https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/Recommendation/DLRM/Dockerfile_preprocessing#L18C1-L19C1
Some more detailed hardware configuration can also be found here: https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/Recommendation/DLRM/preproc/run_spark_gpu_DGX-A100.sh#L53 or here: https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/Recommendation/DLRM/preproc/run_NVTabular.sh#L58 depending on whether you want to run with the spark GPU plugin or with NVTabular. Although, I think changing those last two files might not be necessary if you don't need peak performance.
I'll try to update the READMEs to make this information more visible.
Separately, I must admit the error message you got is very misleading. I will see if inconsistencies in hardware config can be detected at runtime and reported more clearly. Thank you for bringing this to my attention.
I hope this helps.
Hi @tgrel , thanks :) I tried running synthetic and it was fine . Just need some inputs .
root@57532703b646:/workspace/dlrm# python -m dlrm.scripts.main --mode train --dataset_type synthetic_gpu --amp --cuda_graphs --batch_size 131072
Not using distributed mode
DLL 2023-06-24 07:31:26.804642 - PARAMETER logtostderr : False alsologtostderr : False log_dir : v : 0 verbosity : 0 logger_levels : {} stderrthreshold : fatal showprefixforinfo : True run_with_pdb : False pdb_post_mortem : False pdb : False run_with_profiling : False profile_file : None use_cprofile_for_profiling : True only_check_args : False mode : train seed : 12345 batch_size : 131072 test_batch_size : 65536 lr : 24.0 epochs : 1 max_steps : None warmup_factor : 0 warmup_steps : 8000 decay_steps : 24000 decay_start_step : 48000 decay_power : 2 decay_end_lr : 0.0 embedding_type : joint_fused embedding_dim : 128 top_mlp_sizes : [1024, 1024, 512, 256, 1] bottom_mlp_sizes : [512, 256, 128] interaction_op : cuda_dot dataset : None feature_spec : feature_spec.yaml dataset_type : synthetic_gpu shuffle : False shuffle_batch_order : False max_table_size : None hash_indices : False synthetic_dataset_num_entries : 33554432 synthetic_dataset_table_sizes : ['100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000'] synthetic_dataset_numerical_features : 13 synthetic_dataset_use_feature_spec : False load_checkpoint_path : None save_checkpoint_path : None log_path : ./log.json test_freq : None test_after : 0.0 print_freq : 200 benchmark_warmup_steps : 0 base_device : cuda amp : True cuda_graphs : True inference_benchmark_batch_sizes : [1, 64, 4096] inference_benchmark_steps : 200 auc_threshold : None optimized_mlp : True auc_device : GPU backend : nccl bottom_features_ordered : False freeze_mlps : False freeze_embeddings : False Adam_embedding_optimizer : False Adam_MLP_optimizer : False ? : False help : False helpshort : False helpfull : False helpxml : False
W0624 07:31:32.839310 140140320126720 fused_gather_embedding.py:38] Highly specialized embedding for embedding_dim 128
Epoch:[0/1] [200/255] eta: 0:00:10 loss: 0.69671541 step_time: 0.192123 lr: 0.6030
Test: [200/512] step_time: 0.0316
Test: [400/512] step_time: 0.0325
test loss: 0.69313794
Epoch 0 step 253. auc 0.505677
Finished epoch 0 in 0:01:07.
DLL 2023-06-24 07:32:38.290277 - () best_auc : 0.50568 None best_validation_loss : 0.69314 training_loss : 0.69672 best_epoch : 1.00 average_train_throughput : 6.82e+05 samples/s
I have 4 X A100 (80 GB ) GPUs on a single node with 256 GB of CPU memory . Can I reach around 2 ~ 3 GB/s read IOs ?
Just let me know any params apart from batch_size :)
The --dataset_type synthetic_gpu
method uses synthetic data generated on the fly. It doesn't read anything from disk. If you want to stress the I/O, there's a method to generate the synthetic data and store it on disk. You can use this script for it: https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/Recommendation/DLRM/dlrm/scripts/prepare_synthetic_dataset.py
Once you've saved the data, you should remove the --dataset_type synthetic_gpu
option from the training command-line and instead pass the path to the synthetic data you've saved on disk. You can achieve this with: dlrm.scripts.main --dataset <path_to_synthetic_data>
I haven't tested this example on a 4xA100-80GB node. My quick estimate is that the default settings in this script go through 1.7GB of compressed data per second with a full DGX A100-80GB. However, you could increase this by making the neural network faster, e.g., by making the Top MLM smaller (see this parameter: https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/Recommendation/DLRM/dlrm/scripts/main.py#L70) or by decreasing the embedding dimension (see this parameter: https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/Recommendation/DLRM/dlrm/scripts/main.py#L69)
Alternatively, you could write a script that only performs dataloading and nothing else.
@tgrel ...waoo :) let me try all ... Thanks again
@tgrel , i tried your options and was able to get around 1.7GB/s , training was fine but test/validation failed . I haven't changed test_batch_size , where should I make changes to fit test dataset in cuda mem , is it lowering test_batch_size or any other params ?
root@6fd2f5fc8fc4:/workspace/dlrm# python -m torch.distributed.launch --no_python --use_env --nproc_per_node 4 bash -c './bind.sh python -m dlrm.scripts.main \
--dataset /mnt/dlrm_synthetic_data/ --seed 0 --epochs 1 --amp --cuda_graphs --batch_size 3407872'
/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
WARNING:torch.distributed.run:*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
dpkg: warning: version '4.18.0-372.9.1.el8.x86_64' has bad syntax: invalid character in revision number
dpkg: warning: version '4.18.0-372.9.1.el8.x86_64' has bad syntax: invalid character in revision number
dpkg: warning: version '4.18.0-372.9.1.el8.x86_64' has bad syntax: invalid character in revision number
dpkg: warning: version '4.18.0-372.9.1.el8.x86_64' has bad syntax: invalid character in revision number
dpkg: warning: version '4.18.0-372.9.1.el8.x86_64' has bad syntax: invalid character in revision number
dpkg: warning: version '4.18.0-372.9.1.el8.x86_64' has bad syntax: invalid character in revision number
dpkg: warning: version '4.18.0-372.9.1.el8.x86_64' has bad syntax: invalid character in revision number
dpkg: warning: version '4.18.0-372.9.1.el8.x86_64' has bad syntax: invalid character in revision number
num_sockets = 2 num_nodes=2 cores_per_socket=32
num_sockets = 2 num_nodes=2 cores_per_socket=32
num_sockets = 2 num_nodes=2 cores_per_socket=32
num_sockets = 2 num_nodes=2 cores_per_socket=32
I0706 02:35:26.907477 140737350231872 distributed_c10d.py:218] Added key: store_based_barrier_key:1 to store for rank: 1
I0706 02:35:26.907486 140737350231872 distributed_c10d.py:218] Added key: store_based_barrier_key:1 to store for rank: 0
I0706 02:35:26.907539 140737350231872 distributed_c10d.py:218] Added key: store_based_barrier_key:1 to store for rank: 2
I0706 02:35:26.907540 140737350231872 distributed_c10d.py:218] Added key: store_based_barrier_key:1 to store for rank: 3
I0706 02:35:26.907639 140737350231872 distributed_c10d.py:252] Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
I0706 02:35:26.907645 140737350231872 distributed_c10d.py:252] Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
I0706 02:35:26.907689 140737350231872 distributed_c10d.py:252] Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
I0706 02:35:26.907697 140737350231872 distributed_c10d.py:252] Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
DLL 2023-07-06 02:35:26.939505 - PARAMETER logtostderr : False alsologtostderr : False log_dir : v : 0 verbosity : 0 logger_levels : {} stderrthreshold : fatal showprefixforinfo : True run_with_pdb : False pdb_post_mortem : False pdb : False run_with_profiling : False profile_file : None use_cprofile_for_profiling : True only_check_args : False mode : train seed : 0 batch_size : 3407872 test_batch_size : 65536 lr : 24.0 epochs : 1 max_steps : None warmup_factor : 0 warmup_steps : 8000 decay_steps : 24000 decay_start_step : 48000 decay_power : 2 decay_end_lr : 0.0 embedding_type : joint_sparse embedding_dim : 16 top_mlp_sizes : [1024, 512, 256, 1] bottom_mlp_sizes : [64, 32, 16] interaction_op : cuda_dot dataset : /mnt/dlrm_synthetic_data/ feature_spec : feature_spec.yaml dataset_type : parametric shuffle : False shuffle_batch_order : False max_table_size : None hash_indices : False synthetic_dataset_num_entries : 33554432 synthetic_dataset_table_sizes : ['100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000'] synthetic_dataset_numerical_features : 13 synthetic_dataset_use_feature_spec : False load_checkpoint_path : None save_checkpoint_path : None log_path : ./log.json test_freq : None test_after : 0.0 print_freq : 200 benchmark_warmup_steps : 0 base_device : cuda amp : True cuda_graphs : True inference_benchmark_batch_sizes : [1, 64, 4096] inference_benchmark_steps : 200 auc_threshold : None optimized_mlp : True auc_device : GPU backend : nccl bottom_features_ordered : False freeze_mlps : False freeze_embeddings : False Adam_embedding_optimizer : False Adam_MLP_optimizer : False ? : False help : False helpshort : False helpfull : False helpxml : False
/workspace/dlrm/dlrm/data/datasets.py:191: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /opt/pytorch/pytorch/torch/csrc/utils/tensor_numpy.cpp:181.)
return torch.from_numpy(array).to(torch.float32)
Epoch:[0/1] [200/615] eta: 0:02:01 loss: 0.69666475 step_time: 0.291955 lr: 0.6030
Epoch:[0/1] [400/615] eta: 0:01:04 loss: 0.69310302 step_time: 0.304464 lr: 1.2030
Epoch:[0/1] [600/615] eta: 0:00:04 loss: 0.69303346 step_time: 0.305558 lr: 1.8030
Test: [200/32000] step_time: 0.0029
Test: [400/32000] step_time: 0.0031
Test: [600/32000] step_time: 0.0025
Test: [800/32000] step_time: 0.0028
Test: [1000/32000] step_time: 0.0027
Test: [1200/32000] step_time: 0.0027
Test: [1400/32000] step_time: 0.0031
Test: [1600/32000] step_time: 0.0029
Test: [1800/32000] step_time: 0.0029
Test: [2000/32000] step_time: 0.0030
Test: [2200/32000] step_time: 0.0027
Test: [2400/32000] step_time: 0.0027
Test: [2600/32000] step_time: 0.0028
Test: [2800/32000] step_time: 0.0029
Test: [3000/32000] step_time: 0.0027
Test: [3200/32000] step_time: 0.0029
Test: [3400/32000] step_time: 0.0027
Test: [3600/32000] step_time: 0.0026
.
.
.
Test: [31200/32000] step_time: 0.0034
Test: [31400/32000] step_time: 0.0028
Test: [31600/32000] step_time: 0.0028
Test: [31800/32000] step_time: 0.0029
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/workspace/dlrm/dlrm/scripts/main.py", line 842, in <module>
app.run(main)
File "/opt/conda/lib/python3.8/site-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/opt/conda/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "/workspace/dlrm/dlrm/scripts/main.py", line 683, in main
auc, validation_loss = dist_evaluate(trainer.model, data_loader_test)
File "/workspace/dlrm/dlrm/scripts/main.py", line 826, in dist_evaluate
auc = utils.roc_auc_score(y_true, y_score)
File "/workspace/dlrm/dlrm/scripts/utils.py", line 302, in roc_auc_score
desc_score_indices = torch.argsort(y_score, descending=True)
RuntimeError: CUDA out of memory. Tried to allocate 15.62 GiB (GPU 0; 79.35 GiB total capacity; 57.42 GiB already allocated; 8.57 GiB free; 69.29 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1060 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1061 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1071 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1059) of binary: bash
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
main()
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 689, in run
elastic_launch(
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
***************************************
bash FAILED
=======================================
Root Cause:
[0]:
time: 2023-07-06_02:43:46
rank: 0 (local_rank: 0)
exitcode: 1 (pid: 1059)
error_file: <N/A>
msg: "Process failed with exitcode 1"
=======================================
Other Failures:
<NO_OTHER_FAILURES>
***************************************
@tgrel , I tried getting around inference , its running fine , getting around 3GBps with 32768 inference_batch_size. Check the output below .
One Q ......how can I increase the duration and amount of data to be accessed during inference benchmarking ? Atleast for few minutes with 500GB of synthetic dataset so that my DRAM cache(250GB) is fully filled and I can see more activity from NVMe SSD .
root@6fd2f5fc8fc4:/workspace/dlrm# python -m dlrm.scripts.main --mode inference_benchmark --dataset /mnt/dlrm_synthetic_data/ --cuda_graphs
Not using distributed mode
DLL 2023-07-07 02:41:29.795327 - PARAMETER logtostderr : False alsologtostderr : False log_dir : v : 0 verbosity : 0 logger_levels : {} stderrthreshold : fatal showprefixforinfo : True run_with_pdb : False pdb_post_mortem : False pdb : False run_with_profiling : False profile_file : None use_cprofile_for_profiling : True only_check_args : False mode : inference_benchmark seed : 12345 batch_size : 65536 test_batch_size : 65536 lr : 24.0 epochs : 1 max_steps : None warmup_factor : 0 warmup_steps : 8000 decay_steps : 24000 decay_start_step : 48000 decay_power : 2 decay_end_lr : 0.0 embedding_type : joint_sparse embedding_dim : 16 top_mlp_sizes : [1024, 512, 256, 1] bottom_mlp_sizes : [64, 32, 16] interaction_op : cuda_dot dataset : /mnt/dlrm_synthetic_data/ feature_spec : feature_spec.yaml dataset_type : parametric shuffle : False shuffle_batch_order : False max_table_size : None hash_indices : False synthetic_dataset_num_entries : 33554432 synthetic_dataset_table_sizes : ['100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000'] synthetic_dataset_numerical_features : 13 synthetic_dataset_use_feature_spec : False load_checkpoint_path : None save_checkpoint_path : None log_path : ./log.json test_freq : None test_after : 0.0 print_freq : 200 benchmark_warmup_steps : 0 base_device : cuda amp : False cuda_graphs : True inference_benchmark_batch_sizes : [1, 64, 4096, 8192, 16384, 32768, 32768, 32768, 32768, 32768] inference_benchmark_steps : 200 auc_threshold : None optimized_mlp : True auc_device : GPU backend : nccl bottom_features_ordered : False freeze_mlps : False freeze_embeddings : False Adam_embedding_optimizer : False Adam_MLP_optimizer : False ? : False help : False helpshort : False helpfull : False helpxml : False
/workspace/dlrm/dlrm/data/datasets.py:191: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /opt/pytorch/pytorch/torch/csrc/utils/tensor_numpy.cpp:181.)
return torch.from_numpy(array).to(torch.float32)
auc: 0.5738749504089355
auc: 0.5036291480064392
auc: 0.5011399984359741
auc: 0.5010834336280823
auc: 0.5011400580406189
auc: 0.5011354684829712
auc: 0.5011354684829712
auc: 0.5011354684829712
auc: 0.5011354684829712
auc: 0.5011354684829712
DLL 2023-07-07 02:41:58.089163 - () mean_inference_latency_batch_1 : 0.00010312290091789205 s mean_inference_throughput_batch_1 : 9697.167080241608 samples/s mean_inference_latency_batch_64 : 0.00014477624943119068 s mean_inference_throughput_batch_64 : 442061.45864013385 samples/s mean_inference_latency_batch_4096 : 0.00029346580904815835 s mean_inference_throughput_batch_4096 : 13957332.92844291 samples/s mean_inference_latency_batch_8192 : 0.000480277376025135 mean_inference_throughput_batch_8192 : 17056810.10377486 mean_inference_latency_batch_16384 : 0.000780940679979574 mean_inference_throughput_batch_16384 : 20979826.534876548 mean_inference_latency_batch_32768 : 0.0014718452673307889 mean_inference_throughput_batch_32768 : 22263209.813776966
@tgrel , anything you can share on above 2 queries ?
@tgrel
I commented auc calculation in main.py and its a way to get it work but not a correct solution . How can we calculate AUC using multiple GPUs , do we need to change something in utils.py ?
if is_main_process():
y_true = torch.cat(y_true)
y_score = torch.sigmoid(torch.cat(y_score)).float()
auc = None
#auc = utils.roc_auc_score(y_true, y_score)
loss = loss_fn(y_score, y_true).item()
print(f'test loss: {loss:.8f}', )
Test: [27800/32000] step_time: 0.0029
Test: [28000/32000] step_time: 0.0032
Test: [28200/32000] step_time: 0.0028
Test: [28400/32000] step_time: 0.0029
Test: [28600/32000] step_time: 0.0028
Test: [28800/32000] step_time: 0.0028
Test: [29000/32000] step_time: 0.0030
Test: [29200/32000] step_time: 0.0028
Test: [29400/32000] step_time: 0.0030
Test: [29600/32000] step_time: 0.0028
Test: [29800/32000] step_time: 0.0029
Test: [30000/32000] step_time: 0.0028
Test: [30200/32000] step_time: 0.0028
Test: [30400/32000] step_time: 0.0028
Test: [30600/32000] step_time: 0.0028
Test: [30800/32000] step_time: 0.0029
Test: [31000/32000] step_time: 0.0030
Test: [31200/32000] step_time: 0.0029
Test: [31400/32000] step_time: 0.0028
Test: [31600/32000] step_time: 0.0028
Test: [31800/32000] step_time: 0.0028
test loss: 2.65082216
Finished epoch 0 in 0:07:48.
DLL 2023-07-11 07:29:00.994531 - () best_auc : 0.00000 None best_validation_loss : 1000000.00000 training_loss : 2.98322 best_epoch : 0.00 average_train_throughput : 1.25e+07 samples/s
Hi @karanveersingh5623 , please find the answers to your questions below.
1) Regarding the out-of-memory error. It looks like you're trying to compute AUC score of an extremely long test dataset (32k batches). This is currently not supported. I don't think this is a large problem. It doesn't make sense to compute AUC on a synthetic dataset. My understanding is that this is irrelevant to your benchmarking efforts. As a side note, your train batch size is very large. I have not tested such large values and cannot guarantee they work correctly.
2) To increase the number of samples of the synthetic dataset, just change this flag to a desired value. This will let you benchmark I/O with a filled cache. Please bear in mind that it'll take a while to generate such a large dataset.
3) I'm not sure why you are trying to compute AUC for a synthetic dataset. I think having it commented out just for your benchmarking is a useful workaround. If you'd still like to fix the OOM error I've mentioned in point 1), you'd need to write a new procedure that computes AUC iteratively and combines the results.
@tgrel , thanks for coming back . Right , I have commented AUC as its synthetic dataset . I have another query regarding model training .
Below are my top , bottom MLP , batch sizes and NVIDIA-SMI output of model training . Q is 4 X A100 (80GB) gpus are just using 14GB max HBM memory per GPU when given below mentioned parameters which is generating 1.7GBps IOs . How can we double it without changing Batch_size ? Because after increase Batch_size more than 32K , it fails with below error Working params
flags.DEFINE_integer("batch_size", 3407872, "Batch size used for training")
flags.DEFINE_integer("embedding_dim", 8, "Dimensionality of embedding space for categorical features")
flags.DEFINE_list("top_mlp_sizes", [1024, 512, 256, 1], "Linear layer sizes for the top MLP")
flags.DEFINE_list("bottom_mlp_sizes", [32, 16, 8], "Linear layer sizes for the bottom MLP")
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/workspace/dlrm/dlrm/scripts/main.py", line 927, in <module>
app.run(main)
File "/opt/conda/lib/python3.8/site-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/opt/conda/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "/workspace/dlrm/dlrm/scripts/main.py", line 645, in main
loss = trainer.train_step(numerical_features, categorical_features, click)
File "/workspace/dlrm/dlrm/scripts/main.py", line 267, in train_step
return self._warmup_step(*train_step_args)
File "/workspace/dlrm/dlrm/scripts/main.py", line 252, in _warmup_step
self.loss = self._train_step(self.model, *self.static_args)
File "/workspace/dlrm/dlrm/scripts/main.py", line 595, in forward_backward
scaler.scale(loss).backward()
File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/opt/conda/lib/python3.8/site-packages/torch/autograd/__init__.py", line 154, in backward
Variable._execution_engine.run_backward(
SystemError: <built-in method run_backward of torch._C._EngineBase object at 0x7fff36f556e0> returned NULL without setting an error
GEMM wgrad failed with 13
The error message indicates an issue during the backward pass of the training step in the DLRM script. It looks like a SystemError with the GEMM (General Matrix Multiply) operation for weight gradient computation.
Could be Memory Overflow but GPU memories are under utilized The error could be due to insufficient memory to perform the backward pass if training with large embedding tables.
Hi @karanveersingh5623 , please find the answers to your questions below.
- Regarding the out-of-memory error. It looks like you're trying to compute AUC score of an extremely long test dataset (32k batches). This is currently not supported. I don't think this is a large problem. It doesn't make sense to compute AUC on a synthetic dataset. My understanding is that this is irrelevant to your benchmarking efforts. As a side note, your train batch size is very large. I have not tested such large values and cannot guarantee they work correctly.
- To increase the number of samples of the synthetic dataset, just change this flag to a desired value. This will let you benchmark I/O with a filled cache. Please bear in mind that it'll take a while to generate such a large dataset.
- I'm not sure why you are trying to compute AUC for a synthetic dataset. I think having it commented out just for your benchmarking is a useful workaround. If you'd still like to fix the OOM error I've mentioned in point 1), you'd need to write a new procedure that computes AUC iteratively and combines the results.
flags.DEFINE_integer("synthetic_dataset_num_entries",
default=int(500 * 1024 * 1024 * 32 / 8), # 1024 batches for single-GPU training by default
help="Number of samples per epoch for the synthetic dataset")
@tgrel , anything can you share on above ?
Hi @karanveersingh5623,
Regarding 1) – there's a command-line argument here that controls the number of steps for the inference benchmark. Increasing this value appropriately should solve the issue.
Regarding 2) – inference on a model that fits on a singleGPU is a trivially parallelizable workload. You can just run 4 scripts in parallel, one for each GPU.
Closing the issue.
Related to Model/Framework(s) (e.g. GNMT/PyTorch or FasterTransformer/All)
Describe the bug Criterio dataset for DLRM is in the form of day_0.gz to day_23.gz When using the pre-processing docker image , below is the error
To Reproduce Steps to reproduce the behavior:
Expected behavior The prepare_dataset should able to take *.gz files and then process the dataset but its failing . PLease let me know what I am doing wrong .
Environment Please provide at least:
docker build -t nvidia_dlrm_preprocessing -f Dockerfile_preprocessing .
docker run --runtime=nvidia -it --rm --ipc=host -v /mnt/dlrm/criterio_dataset:/data/dlrm nvidia_dlrm_preprocessing bash
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+