NVIDIA / DeepLearningExamples

State-of-the-Art Deep Learning scripts organized by models - easy to train and deploy with reproducible accuracy and performance on enterprise-grade infrastructure.
13.32k stars 3.19k forks source link

[DLRM/PYTORCH] Criterio Data preprocessing is failing #1305

Closed karanveersingh5623 closed 1 year ago

karanveersingh5623 commented 1 year ago

Related to Model/Framework(s) (e.g. GNMT/PyTorch or FasterTransformer/All)

Describe the bug Criterio dataset for DLRM is in the form of day_0.gz to day_23.gz When using the pre-processing docker image , below is the error

(rapids) root@00cae536f649:/workspace/dlrm/preproc# ./prepare_dataset.sh 2 GPU Spark
+ ls -ltrash
total 112K
4.0K -rw-r--r--. 1 root root 1.6K Apr 20 11:06 NVT_shuffle_spark.py
4.0K -rw-r--r--. 1 root root 1.3K Apr 20 11:06 DGX-A100_config.sh
4.0K -rw-r--r--. 1 root root 1.3K Apr 20 11:06 DGX-2_config.sh
8.0K -rw-r--r--. 1 root root 5.2K Apr 20 11:06 split_dataset.py
4.0K -rwxr-xr-x. 1 root root 1.1K Apr 20 11:06 run_spark.sh
8.0K -rw-r--r--. 1 root root 7.6K Apr 20 11:06 run_spark_gpu_DGX-A100.sh
8.0K -rwxr-xr-x. 1 root root 7.6K Apr 20 11:06 run_spark_gpu_DGX-2.sh
8.0K -rwxr-xr-x. 1 root root 5.7K Apr 20 11:06 run_spark_cpu.sh
4.0K -rwxr-xr-x. 1 root root 3.1K Apr 20 11:06 run_NVTabular.sh
 12K -rw-r--r--. 1 root root  11K Apr 20 11:06 preproc_NVTabular.py
4.0K -rw-r--r--. 1 root root 3.4K Apr 20 11:06 parquet_to_binary.py
   0 drwxr-xr-x  1 root root   21 Jun 13 05:26 ..
4.0K -rwxrwxrwx. 1 root root 3.0K Jun 13 09:55 prepare_dataset.sh
4.0K -rwxr-xr-x  1 root root 1.1K Jun 14 01:06 verify_criteo_downloaded.sh
8.0K -rw-r--r--  1 root root 6.6K Jun 14 01:08 submit_dict_log.txt
8.0K -rw-r--r--  1 root root 6.6K Jun 14 01:08 submit_train_log.txt
 20K -rw-r--r--  1 root root  20K Jun 14 01:14 spark_data_utils.py
   0 drwxr-xr-x. 1 root root  149 Jun 14 01:14 .
+ rm -rf /data/dlrm/spark
+ rm -rf /data/dlrm/intermediate_binary
+ rm -rf /data/dlrm/output
+ rm -rf /data/dlrm/criteo_parquet
+ rm -rf /data/dlrm/binary_dataset
+ download_dir=/data/dlrm
+ ./verify_criteo_downloaded.sh /data/dlrm
++ download_dir=/data/dlrm
++ cd /data/dlrm
+++ seq 0 23
++ for i in $(seq 0 23)
++ filename=day_0.gz
++ '[' -f day_0.gz ']'
++ echo 'day_0.gz exists, OK'
day_0.gz exists, OK
++ for i in $(seq 0 23)
++ filename=day_1.gz
++ '[' -f day_1.gz ']'
++ echo 'day_1.gz exists, OK'
day_1.gz exists, OK
++ for i in $(seq 0 23)
++ filename=day_2.gz
++ '[' -f day_2.gz ']'
++ echo 'day_2.gz exists, OK'
day_2.gz exists, OK
++ for i in $(seq 0 23)
++ filename=day_3.gz
++ '[' -f day_3.gz ']'
++ echo 'day_3.gz exists, OK'
day_3.gz exists, OK
++ for i in $(seq 0 23)
++ filename=day_4.gz
++ '[' -f day_4.gz ']'
++ echo 'day_4.gz exists, OK'
day_4.gz exists, OK
++ for i in $(seq 0 23)
++ filename=day_5.gz
++ '[' -f day_5.gz ']'
++ echo 'day_5.gz exists, OK'
day_5.gz exists, OK
++ for i in $(seq 0 23)
++ filename=day_6.gz
++ '[' -f day_6.gz ']'
++ echo 'day_6.gz exists, OK'
day_6.gz exists, OK
++ for i in $(seq 0 23)
++ filename=day_7.gz
++ '[' -f day_7.gz ']'
++ echo 'day_7.gz exists, OK'
day_7.gz exists, OK
++ for i in $(seq 0 23)
++ filename=day_8.gz
++ '[' -f day_8.gz ']'
++ echo 'day_8.gz exists, OK'
day_8.gz exists, OK
++ for i in $(seq 0 23)
++ filename=day_9.gz
++ '[' -f day_9.gz ']'
++ echo 'day_9.gz exists, OK'
day_9.gz exists, OK
++ for i in $(seq 0 23)
++ filename=day_10.gz
++ '[' -f day_10.gz ']'
++ echo 'day_10.gz exists, OK'
day_10.gz exists, OK
++ for i in $(seq 0 23)
++ filename=day_11.gz
++ '[' -f day_11.gz ']'
++ echo 'day_11.gz exists, OK'
day_11.gz exists, OK
++ for i in $(seq 0 23)
++ filename=day_12.gz
++ '[' -f day_12.gz ']'
++ echo 'day_12.gz exists, OK'
day_12.gz exists, OK
++ for i in $(seq 0 23)
++ filename=day_13.gz
++ '[' -f day_13.gz ']'
++ echo 'day_13.gz exists, OK'
day_13.gz exists, OK
++ for i in $(seq 0 23)
++ filename=day_14.gz
++ '[' -f day_14.gz ']'
++ echo 'day_14.gz exists, OK'
day_14.gz exists, OK
++ for i in $(seq 0 23)
++ filename=day_15.gz
++ '[' -f day_15.gz ']'
++ echo 'day_15.gz exists, OK'
day_15.gz exists, OK
++ for i in $(seq 0 23)
++ filename=day_16.gz
++ '[' -f day_16.gz ']'
++ echo 'day_16.gz exists, OK'
day_16.gz exists, OK
++ for i in $(seq 0 23)
++ filename=day_17.gz
++ '[' -f day_17.gz ']'
++ echo 'day_17.gz exists, OK'
day_17.gz exists, OK
++ for i in $(seq 0 23)
++ filename=day_18.gz
++ '[' -f day_18.gz ']'
++ echo 'day_18.gz exists, OK'
day_18.gz exists, OK
++ for i in $(seq 0 23)
++ filename=day_19.gz
++ '[' -f day_19.gz ']'
++ echo 'day_19.gz exists, OK'
day_19.gz exists, OK
++ for i in $(seq 0 23)
++ filename=day_20.gz
++ '[' -f day_20.gz ']'
++ echo 'day_20.gz exists, OK'
day_20.gz exists, OK
++ for i in $(seq 0 23)
++ filename=day_21.gz
++ '[' -f day_21.gz ']'
++ echo 'day_21.gz exists, OK'
day_21.gz exists, OK
++ for i in $(seq 0 23)
++ filename=day_22.gz
++ '[' -f day_22.gz ']'
++ echo 'day_22.gz exists, OK'
day_22.gz exists, OK
++ for i in $(seq 0 23)
++ filename=day_23.gz
++ '[' -f day_23.gz ']'
++ echo 'day_23.gz exists, OK'
day_23.gz exists, OK
++ cd -
/workspace/dlrm/preproc
++ echo 'Criteo data verified'
Criteo data verified
+ output_path=/data/dlrm/output
+ '[' Spark = NVTabular ']'
+ '[' -f /data/dlrm/output/train/_SUCCESS ']'
+ echo 'Performing spark preprocessing'
Performing spark preprocessing
+ ./run_spark.sh GPU /data/dlrm /data/dlrm/output 2
Input mode option: GPU
Run with GPU.
Starting spark standalone
starting org.apache.spark.deploy.master.Master, logging to /opt/spark/logs/spark--org.apache.spark.deploy.master.Master-1-00cae536f649.out
starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark--org.apache.spark.deploy.worker.Worker-1-00cae536f649.out
Generating the dictionary...
23/06/14 01:15:14 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
23/06/14 01:15:15 INFO SparkContext: Running Spark version 3.0.1
23/06/14 01:15:15 INFO ResourceUtils: ==============================================================
23/06/14 01:15:15 INFO ResourceUtils: Resources for spark.driver:

23/06/14 01:15:15 INFO ResourceUtils: ==============================================================
23/06/14 01:15:15 INFO SparkContext: Submitted application: spark_data_utils.py
23/06/14 01:15:15 INFO SecurityManager: Changing view acls to: root
23/06/14 01:15:15 INFO SecurityManager: Changing modify acls to: root
23/06/14 01:15:15 INFO SecurityManager: Changing view acls groups to:
23/06/14 01:15:15 INFO SecurityManager: Changing modify acls groups to:
23/06/14 01:15:15 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
23/06/14 01:15:15 INFO Utils: Successfully started service 'sparkDriver' on port 40960.
23/06/14 01:15:15 INFO SparkEnv: Registering MapOutputTracker
23/06/14 01:15:15 INFO SparkEnv: Registering BlockManagerMaster
23/06/14 01:15:15 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
23/06/14 01:15:15 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
23/06/14 01:15:15 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
23/06/14 01:15:15 INFO DiskBlockManager: Created local directory at /data/dlrm/spark/tmp/blockmgr-8be2148e-a8f8-4622-a61e-fe8bb4eabc86
23/06/14 01:15:15 INFO MemoryStore: MemoryStore started with capacity 16.9 GiB
23/06/14 01:15:15 INFO SparkEnv: Registering OutputCommitCoordinator
23/06/14 01:15:16 INFO Utils: Successfully started service 'SparkUI' on port 4040.
23/06/14 01:15:16 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://00cae536f649:4040
23/06/14 01:15:16 INFO DriverPluginContainer: Initialized driver component for plugin com.nvidia.spark.SQLPlugin.
23/06/14 01:15:16 WARN SparkContext: The configuration of resource: gpu (exec = 1, task = 1/100, runnable tasks = 100) will result in wasted resources due to resource CPU limiting the number of runnable tasks per executor to: 5. Please adjust your configuration.
23/06/14 01:15:16 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://00cae536f649:7077...
23/06/14 01:15:16 INFO TransportClientFactory: Successfully created connection to 00cae536f649/172.17.0.2:7077 after 42 ms (0 ms spent in bootstraps)
23/06/14 01:15:16 INFO StandaloneSchedulerBackend: Connected to Spark cluster with app ID app-20230614011516-0000
23/06/14 01:15:16 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 43916.
23/06/14 01:15:16 INFO NettyBlockTransferService: Server created on 00cae536f649:43916
23/06/14 01:15:16 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
23/06/14 01:15:16 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 00cae536f649, 43916, None)
23/06/14 01:15:16 INFO BlockManagerMasterEndpoint: Registering block manager 00cae536f649:43916 with 16.9 GiB RAM, BlockManagerId(driver, 00cae536f649, 43916, None)
23/06/14 01:15:16 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 00cae536f649, 43916, None)
23/06/14 01:15:16 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 00cae536f649, 43916, None)
23/06/14 01:15:16 INFO StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
23/06/14 01:15:17 WARN SQLExecPlugin: Installing extensions to enable rapids GPU SQL support. To disable GPU support set `spark.rapids.sql.enabled` to false
23/06/14 01:15:17 INFO ShimLoader: Loading shim for Spark version: 3.0.1
23/06/14 01:15:17 INFO ShimLoader: Found shims: List(com.nvidia.spark.rapids.shims.spark301.SparkShimServiceProvider@36baec65)
23/06/14 01:15:17 WARN Plugin: Installing rapids UDF compiler extensions to Spark. The compiler is disabled by default. To enable it, set `spark.rapids.sql.udfCompiler.enabled` to true
23/06/14 01:15:17 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/workspace/dlrm/preproc/spark-warehouse').
23/06/14 01:15:17 INFO SharedState: Warehouse path is 'file:/workspace/dlrm/preproc/spark-warehouse'.
23/06/14 01:15:18 INFO InMemoryFileIndex: It took 48 ms to list leaf files for 24 paths.
23/06/14 01:15:20 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
23/06/14 01:15:20 INFO FileSourceStrategy: Pruning directories with:
23/06/14 01:15:20 INFO FileSourceStrategy: Pushed Filters:
23/06/14 01:15:20 INFO FileSourceStrategy: Post-Scan Filters:
23/06/14 01:15:20 INFO FileSourceStrategy: Output Data Schema: struct<_c14: string, _c15: string, _c16: string, _c17: string, _c18: string ... 24 more fields>
23/06/14 01:15:20 INFO HiveConf: Found configuration file null
23/06/14 01:15:20 WARN GpuOverrides:
*Exec <DataWritingCommandExec> will run on GPU
  *Output <InsertIntoHadoopFsRelationCommand> will run on GPU
  *Exec <ProjectExec> will run on GPU
    *Expression <Alias> (monotonically_increasing_id() - cast(shiftleft(part_id#100, 33) as bigint)) AS mono_id#105L will run on GPU
      *Expression <Subtract> (monotonically_increasing_id() - cast(shiftleft(part_id#100, 33) as bigint)) will run on GPU
        *Expression <MonotonicallyIncreasingID> monotonically_increasing_id() will run on GPU
        *Expression <Cast> cast(shiftleft(part_id#100, 33) as bigint) will run on GPU
          *Expression <ShiftLeft> shiftleft(part_id#100, 33) will run on GPU
    *Exec <ProjectExec> will run on GPU
      *Expression <Alias> SPARK_PARTITION_ID() AS part_id#100 will run on GPU
        *Expression <SparkPartitionID> SPARK_PARTITION_ID() will run on GPU
      *Exec <SortExec> will run on GPU
        *Expression <SortOrder> column_id#84 ASC NULLS FIRST will run on GPU
        *Expression <SortOrder> count#93L DESC NULLS LAST will run on GPU
        *Exec <ShuffleExchangeExec> will run on GPU
          *Partitioning <RangePartitioning> will run on GPU
            *Expression <SortOrder> column_id#84 ASC NULLS FIRST will run on GPU
            *Expression <SortOrder> count#93L DESC NULLS LAST will run on GPU
          *Exec <FilterExec> will run on GPU
            *Expression <Or> (NOT column_id#84 INSET (0,5,10,24,25,14,20,1,6,21,9,13,2,17,22,12,7,3,18,16,11,23,8,19,4,15) OR (count#93L >= 2)) will run on GPU
              *Expression <Not> NOT column_id#84 INSET (0,5,10,24,25,14,20,1,6,21,9,13,2,17,22,12,7,3,18,16,11,23,8,19,4,15) will run on GPU
                *Expression <InSet> column_id#84 INSET (0,5,10,24,25,14,20,1,6,21,9,13,2,17,22,12,7,3,18,16,11,23,8,19,4,15) will run on GPU
              *Expression <GreaterThanOrEqual> (count#93L >= 2) will run on GPU
            *Exec <HashAggregateExec> will run on GPU
              *Expression <AggregateExpression> count(1) will run on GPU
                *Expression <Count> count(1) will run on GPU
              *Expression <Alias> count(1)#92L AS count#93L will run on GPU
              *Exec <ShuffleExchangeExec> will run on GPU
                *Partitioning <HashPartitioning> will run on GPU
                *Exec <HashAggregateExec> will run on GPU
                  *Expression <AggregateExpression> partial_count(1) will run on GPU
                    *Expression <Count> count(1) will run on GPU
                  *Exec <ProjectExec> will run on GPU
                    *Expression <Alias> pos#80 AS column_id#84 will run on GPU
                    *Expression <Alias> col#81 AS data#87 will run on GPU
                    *Exec <FilterExec> will run on GPU
                      *Expression <IsNotNull> isnotnull(col#81) will run on GPU
                      *Exec <GenerateExec> will run on GPU
                        *Exec <FileSourceScanExec> will run on GPU

23/06/14 01:15:21 INFO GpuParquetFileFormat: Using default output committer for Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter
23/06/14 01:15:21 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
23/06/14 01:15:21 INFO SQLHadoopMapReduceCommitProtocol: Using user defined output committer class org.apache.parquet.hadoop.ParquetOutputCommitter
23/06/14 01:15:21 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
23/06/14 01:15:21 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.parquet.hadoop.ParquetOutputCommitter
23/06/14 01:15:21 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 401.6 KiB, free 16.9 GiB)
23/06/14 01:15:21 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 24.9 KiB, free 16.9 GiB)
23/06/14 01:15:21 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 00cae536f649:43916 (size: 24.9 KiB, free: 16.9 GiB)
23/06/14 01:15:21 INFO SparkContext: Created broadcast 0 from broadcast at GpuReadCSVFileFormat.scala:46
23/06/14 01:15:21 INFO GpuFileSourceScanExec: Planning scan with bin packing, max size: 1073741824 bytes, open cost is considered as scanning 4194304 bytes.
23/06/14 01:15:21 INFO GpuFileSourceScanExec: Using the original per file parquet reader
23/06/14 01:15:21 INFO CodeGenerator: Code generated in 182.160886 ms
23/06/14 01:15:22 INFO SparkContext: Starting job: collect at GpuRangePartitioner.scala:46
23/06/14 01:15:22 INFO DAGScheduler: Registering RDD 7 (executeColumnar at GpuShuffleCoalesceExec.scala:67) as input to shuffle 0
23/06/14 01:15:22 INFO DAGScheduler: Got job 0 (collect at GpuRangePartitioner.scala:46) with 48 output partitions
23/06/14 01:15:22 INFO DAGScheduler: Final stage: ResultStage 1 (collect at GpuRangePartitioner.scala:46)
23/06/14 01:15:22 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
23/06/14 01:15:22 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0)
23/06/14 01:15:22 INFO DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[7] at executeColumnar at GpuShuffleCoalesceExec.scala:67), which has no missing parents
23/06/14 01:15:22 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 29.9 KiB, free 16.9 GiB)
23/06/14 01:15:22 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 13.0 KiB, free 16.9 GiB)
23/06/14 01:15:22 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 00cae536f649:43916 (size: 13.0 KiB, free: 16.9 GiB)
23/06/14 01:15:22 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1223
23/06/14 01:15:22 INFO DAGScheduler: Submitting 24 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[7] at executeColumnar at GpuShuffleCoalesceExec.scala:67) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
23/06/14 01:15:22 INFO TaskSchedulerImpl: Adding task set 0.0 with 24 tasks
23/06/14 01:15:37 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
23/06/14 01:15:52 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
23/06/14 01:16:07 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
23/06/14 01:16:22 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
23/06/14 01:16:37 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
23/06/14 01:16:52 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
23/06/14 01:17:07 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
23/06/14 01:17:22 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
23/06/14 01:17:37 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
23/06/14 01:17:52 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

To Reproduce Steps to reproduce the behavior:

  1. Created docker image and run the pre-processing container
  2. cd /workspace/dlrm/preproc/
  3. ./prepare_dataset.sh 2 GPU Spark
  4. chmod 777 prepare_dataset.sh
  5. ./prepare_dataset.sh 2 GPU Spark
  6. vim prepare_dataset.sh --> change line 40 -->download_dir=${download_dir:-'/data/dlrm/criteo'} --> download_dir=${download_dir:-'/data/dlrm'}
  7. vim verify_criteo_downloaded.sh --> Change line 20-- download_dir=${1:-'/data/dlrm/criteo'} --> downloaddir=${1:-'/data/dlrm'} and change line line 24-- filename=day${i} --> filename=day_${i}.gz
  8. vim spark_datautils.py --> Change line 242 -- paths = [os.path.join(folder, 'day%d' % i) for i in dayrange] --> paths = [os.path.join(folder, 'day%d.gz' % i) for i in day_range]
  9. Changed the above files because the preparedataset.sh is not taking day*.gz files

Expected behavior The prepare_dataset should able to take *.gz files and then process the dataset but its failing . PLease let me know what I am doing wrong .

Environment Please provide at least:

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

karanveersingh5623 commented 1 year ago

@tgrel , please let me know if any further info is required

karanveersingh5623 commented 1 year ago

@tgrel , even after not changing any scripts and running with day_* files , still its failing , do I have to make docker image with T4 GPUs ?

(rapids) root@00cae536f649:/workspace/dlrm/preproc# ./prepare_dataset.sh 2 GPU Spark
+ ls -ltrash
total 124K
4.0K -rw-r--r--. 1 root root 1.6K Apr 20 11:06 NVT_shuffle_spark.py
4.0K -rw-r--r--. 1 root root 1.3K Apr 20 11:06 DGX-A100_config.sh
4.0K -rw-r--r--. 1 root root 1.3K Apr 20 11:06 DGX-2_config.sh
8.0K -rw-r--r--. 1 root root 5.2K Apr 20 11:06 split_dataset.py
4.0K -rwxr-xr-x. 1 root root 1.1K Apr 20 11:06 run_spark.sh
8.0K -rw-r--r--. 1 root root 7.6K Apr 20 11:06 run_spark_gpu_DGX-A100.sh
8.0K -rwxr-xr-x. 1 root root 7.6K Apr 20 11:06 run_spark_gpu_DGX-2.sh
8.0K -rwxr-xr-x. 1 root root 5.7K Apr 20 11:06 run_spark_cpu.sh
4.0K -rwxr-xr-x. 1 root root 3.1K Apr 20 11:06 run_NVTabular.sh
 12K -rw-r--r--. 1 root root  11K Apr 20 11:06 preproc_NVTabular.py
4.0K -rw-r--r--. 1 root root 3.4K Apr 20 11:06 parquet_to_binary.py
   0 drwxr-xr-x  1 root root   21 Jun 13 05:26 ..
4.0K -rwxrwxrwx. 1 root root 3.0K Jun 13 09:55 prepare_dataset.sh
8.0K -rw-r--r--  1 root root 6.6K Jun 14 01:08 submit_train_log.txt
 20K -rw-r--r--  1 root root  18K Jun 14 01:24 submit_dict_log.txt
 20K -rw-r--r--  1 root root  20K Jun 14 01:48 spark_data_utils.py
4.0K -rwxr-xr-x  1 root root 1.1K Jun 14 01:48 verify_criteo_downloaded.sh
   0 drwxr-xr-x. 1 root root  149 Jun 14 01:48 .
+ rm -rf /data/dlrm/spark
+ rm -rf /data/dlrm/intermediate_binary
+ rm -rf /data/dlrm/output
+ rm -rf /data/dlrm/criteo_parquet
+ rm -rf /data/dlrm/binary_dataset
+ download_dir=/data/dlrm
+ ./verify_criteo_downloaded.sh /data/dlrm
++ download_dir=/data/dlrm
++ cd /data/dlrm
+++ seq 0 23
++ for i in $(seq 0 23)
++ filename=day_0
++ '[' -f day_0 ']'
++ echo 'day_0 exists, OK'
day_0 exists, OK
++ for i in $(seq 0 23)
++ filename=day_1
++ '[' -f day_1 ']'
++ echo 'day_1 exists, OK'
day_1 exists, OK
++ for i in $(seq 0 23)
++ filename=day_2
++ '[' -f day_2 ']'
++ echo 'day_2 exists, OK'
day_2 exists, OK
++ for i in $(seq 0 23)
++ filename=day_3
++ '[' -f day_3 ']'
++ echo 'day_3 exists, OK'
day_3 exists, OK
++ for i in $(seq 0 23)
++ filename=day_4
++ '[' -f day_4 ']'
++ echo 'day_4 exists, OK'
day_4 exists, OK
++ for i in $(seq 0 23)
++ filename=day_5
++ '[' -f day_5 ']'
++ echo 'day_5 exists, OK'
day_5 exists, OK
++ for i in $(seq 0 23)
++ filename=day_6
++ '[' -f day_6 ']'
++ echo 'day_6 exists, OK'
day_6 exists, OK
++ for i in $(seq 0 23)
++ filename=day_7
++ '[' -f day_7 ']'
++ echo 'day_7 exists, OK'
day_7 exists, OK
++ for i in $(seq 0 23)
++ filename=day_8
++ '[' -f day_8 ']'
++ echo 'day_8 exists, OK'
day_8 exists, OK
++ for i in $(seq 0 23)
++ filename=day_9
++ '[' -f day_9 ']'
++ echo 'day_9 exists, OK'
day_9 exists, OK
++ for i in $(seq 0 23)
++ filename=day_10
++ '[' -f day_10 ']'
++ echo 'day_10 exists, OK'
day_10 exists, OK
++ for i in $(seq 0 23)
++ filename=day_11
++ '[' -f day_11 ']'
++ echo 'day_11 exists, OK'
day_11 exists, OK
++ for i in $(seq 0 23)
++ filename=day_12
++ '[' -f day_12 ']'
++ echo 'day_12 exists, OK'
day_12 exists, OK
++ for i in $(seq 0 23)
++ filename=day_13
++ '[' -f day_13 ']'
++ echo 'day_13 exists, OK'
day_13 exists, OK
++ for i in $(seq 0 23)
++ filename=day_14
++ '[' -f day_14 ']'
++ echo 'day_14 exists, OK'
day_14 exists, OK
++ for i in $(seq 0 23)
++ filename=day_15
++ '[' -f day_15 ']'
++ echo 'day_15 exists, OK'
day_15 exists, OK
++ for i in $(seq 0 23)
++ filename=day_16
++ '[' -f day_16 ']'
++ echo 'day_16 exists, OK'
day_16 exists, OK
++ for i in $(seq 0 23)
++ filename=day_17
++ '[' -f day_17 ']'
++ echo 'day_17 exists, OK'
day_17 exists, OK
++ for i in $(seq 0 23)
++ filename=day_18
++ '[' -f day_18 ']'
++ echo 'day_18 exists, OK'
day_18 exists, OK
++ for i in $(seq 0 23)
++ filename=day_19
++ '[' -f day_19 ']'
++ echo 'day_19 exists, OK'
day_19 exists, OK
++ for i in $(seq 0 23)
++ filename=day_20
++ '[' -f day_20 ']'
++ echo 'day_20 exists, OK'
day_20 exists, OK
++ for i in $(seq 0 23)
++ filename=day_21
++ '[' -f day_21 ']'
++ echo 'day_21 exists, OK'
day_21 exists, OK
++ for i in $(seq 0 23)
++ filename=day_22
++ '[' -f day_22 ']'
++ echo 'day_22 exists, OK'
day_22 exists, OK
++ for i in $(seq 0 23)
++ filename=day_23
++ '[' -f day_23 ']'
++ echo 'day_23 exists, OK'
day_23 exists, OK
++ cd -
/workspace/dlrm/preproc
++ echo 'Criteo data verified'
Criteo data verified
+ output_path=/data/dlrm/output
+ '[' Spark = NVTabular ']'
+ '[' -f /data/dlrm/output/train/_SUCCESS ']'
+ echo 'Performing spark preprocessing'
Performing spark preprocessing
+ ./run_spark.sh GPU /data/dlrm /data/dlrm/output 2
Input mode option: GPU
Run with GPU.
Starting spark standalone
starting org.apache.spark.deploy.master.Master, logging to /opt/spark/logs/spark--org.apache.spark.deploy.master.Master-1-00cae536f649.out
starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark--org.apache.spark.deploy.worker.Worker-1-00cae536f649.out
Generating the dictionary...
23/06/14 08:16:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
23/06/14 08:16:08 INFO SparkContext: Running Spark version 3.0.1
23/06/14 08:16:08 INFO ResourceUtils: ==============================================================
23/06/14 08:16:08 INFO ResourceUtils: Resources for spark.driver:

23/06/14 08:16:08 INFO ResourceUtils: ==============================================================
23/06/14 08:16:08 INFO SparkContext: Submitted application: spark_data_utils.py
23/06/14 08:16:08 INFO SecurityManager: Changing view acls to: root
23/06/14 08:16:08 INFO SecurityManager: Changing modify acls to: root
23/06/14 08:16:08 INFO SecurityManager: Changing view acls groups to:
23/06/14 08:16:08 INFO SecurityManager: Changing modify acls groups to:
23/06/14 08:16:08 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
23/06/14 08:16:09 INFO Utils: Successfully started service 'sparkDriver' on port 34076.
23/06/14 08:16:09 INFO SparkEnv: Registering MapOutputTracker
23/06/14 08:16:09 INFO SparkEnv: Registering BlockManagerMaster
23/06/14 08:16:09 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
23/06/14 08:16:09 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
23/06/14 08:16:09 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
23/06/14 08:16:09 INFO DiskBlockManager: Created local directory at /data/dlrm/spark/tmp/blockmgr-a83c50b0-6810-4489-9ac2-844b63b09261
23/06/14 08:16:09 INFO MemoryStore: MemoryStore started with capacity 16.9 GiB
23/06/14 08:16:09 INFO SparkEnv: Registering OutputCommitCoordinator
23/06/14 08:16:09 INFO Utils: Successfully started service 'SparkUI' on port 4040.
23/06/14 08:16:09 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://00cae536f649:4040
23/06/14 08:16:09 INFO DriverPluginContainer: Initialized driver component for plugin com.nvidia.spark.SQLPlugin.
23/06/14 08:16:09 WARN SparkContext: The configuration of resource: gpu (exec = 1, task = 1/100, runnable tasks = 100) will result in wasted resources due to resource CPU limiting the number of runnable tasks per executor to: 5. Please adjust your configuration.
23/06/14 08:16:09 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://00cae536f649:7077...
23/06/14 08:16:09 INFO TransportClientFactory: Successfully created connection to 00cae536f649/172.17.0.2:7077 after 47 ms (0 ms spent in bootstraps)
23/06/14 08:16:10 INFO StandaloneSchedulerBackend: Connected to Spark cluster with app ID app-20230614081610-0000
23/06/14 08:16:10 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 38388.
23/06/14 08:16:10 INFO NettyBlockTransferService: Server created on 00cae536f649:38388
23/06/14 08:16:10 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
23/06/14 08:16:10 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 00cae536f649, 38388, None)
23/06/14 08:16:10 INFO BlockManagerMasterEndpoint: Registering block manager 00cae536f649:38388 with 16.9 GiB RAM, BlockManagerId(driver, 00cae536f649, 38388, None)
23/06/14 08:16:10 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 00cae536f649, 38388, None)
23/06/14 08:16:10 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 00cae536f649, 38388, None)
23/06/14 08:16:10 INFO StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
23/06/14 08:16:10 WARN SQLExecPlugin: Installing extensions to enable rapids GPU SQL support. To disable GPU support set `spark.rapids.sql.enabled` to false
23/06/14 08:16:10 INFO ShimLoader: Loading shim for Spark version: 3.0.1
23/06/14 08:16:10 INFO ShimLoader: Found shims: List(com.nvidia.spark.rapids.shims.spark301.SparkShimServiceProvider@abcbf79)
23/06/14 08:16:10 WARN Plugin: Installing rapids UDF compiler extensions to Spark. The compiler is disabled by default. To enable it, set `spark.rapids.sql.udfCompiler.enabled` to true
23/06/14 08:16:10 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/workspace/dlrm/preproc/spark-warehouse').
23/06/14 08:16:10 INFO SharedState: Warehouse path is 'file:/workspace/dlrm/preproc/spark-warehouse'.
23/06/14 08:16:11 INFO InMemoryFileIndex: It took 54 ms to list leaf files for 24 paths.
23/06/14 08:16:13 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
23/06/14 08:16:14 INFO FileSourceStrategy: Pruning directories with:
23/06/14 08:16:14 INFO FileSourceStrategy: Pushed Filters:
23/06/14 08:16:14 INFO FileSourceStrategy: Post-Scan Filters:
23/06/14 08:16:14 INFO FileSourceStrategy: Output Data Schema: struct<_c14: string, _c15: string, _c16: string, _c17: string, _c18: string ... 24 more fields>
23/06/14 08:16:14 INFO HiveConf: Found configuration file null
23/06/14 08:16:14 WARN GpuOverrides:
*Exec <DataWritingCommandExec> will run on GPU
  *Output <InsertIntoHadoopFsRelationCommand> will run on GPU
  *Exec <ProjectExec> will run on GPU
    *Expression <Alias> (monotonically_increasing_id() - cast(shiftleft(part_id#100, 33) as bigint)) AS mono_id#105L will run on GPU
      *Expression <Subtract> (monotonically_increasing_id() - cast(shiftleft(part_id#100, 33) as bigint)) will run on GPU
        *Expression <MonotonicallyIncreasingID> monotonically_increasing_id() will run on GPU
        *Expression <Cast> cast(shiftleft(part_id#100, 33) as bigint) will run on GPU
          *Expression <ShiftLeft> shiftleft(part_id#100, 33) will run on GPU
    *Exec <ProjectExec> will run on GPU
      *Expression <Alias> SPARK_PARTITION_ID() AS part_id#100 will run on GPU
        *Expression <SparkPartitionID> SPARK_PARTITION_ID() will run on GPU
      *Exec <SortExec> will run on GPU
        *Expression <SortOrder> column_id#84 ASC NULLS FIRST will run on GPU
        *Expression <SortOrder> count#93L DESC NULLS LAST will run on GPU
        *Exec <ShuffleExchangeExec> will run on GPU
          *Partitioning <RangePartitioning> will run on GPU
            *Expression <SortOrder> column_id#84 ASC NULLS FIRST will run on GPU
            *Expression <SortOrder> count#93L DESC NULLS LAST will run on GPU
          *Exec <FilterExec> will run on GPU
            *Expression <Or> (NOT column_id#84 INSET (0,5,10,24,25,14,20,1,6,21,9,13,2,17,22,12,7,3,18,16,11,23,8,19,4,15) OR (count#93L >= 2)) will run on GPU
              *Expression <Not> NOT column_id#84 INSET (0,5,10,24,25,14,20,1,6,21,9,13,2,17,22,12,7,3,18,16,11,23,8,19,4,15) will run on GPU
                *Expression <InSet> column_id#84 INSET (0,5,10,24,25,14,20,1,6,21,9,13,2,17,22,12,7,3,18,16,11,23,8,19,4,15) will run on GPU
              *Expression <GreaterThanOrEqual> (count#93L >= 2) will run on GPU
            *Exec <HashAggregateExec> will run on GPU
              *Expression <AggregateExpression> count(1) will run on GPU
                *Expression <Count> count(1) will run on GPU
              *Expression <Alias> count(1)#92L AS count#93L will run on GPU
              *Exec <ShuffleExchangeExec> will run on GPU
                *Partitioning <HashPartitioning> will run on GPU
                *Exec <HashAggregateExec> will run on GPU
                  *Expression <AggregateExpression> partial_count(1) will run on GPU
                    *Expression <Count> count(1) will run on GPU
                  *Exec <ProjectExec> will run on GPU
                    *Expression <Alias> pos#80 AS column_id#84 will run on GPU
                    *Expression <Alias> col#81 AS data#87 will run on GPU
                    *Exec <FilterExec> will run on GPU
                      *Expression <IsNotNull> isnotnull(col#81) will run on GPU
                      *Exec <GenerateExec> will run on GPU
                        *Exec <FileSourceScanExec> will run on GPU

23/06/14 08:16:15 INFO GpuParquetFileFormat: Using default output committer for Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter
23/06/14 08:16:15 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
23/06/14 08:16:15 INFO SQLHadoopMapReduceCommitProtocol: Using user defined output committer class org.apache.parquet.hadoop.ParquetOutputCommitter
23/06/14 08:16:15 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
23/06/14 08:16:15 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.parquet.hadoop.ParquetOutputCommitter
23/06/14 08:16:15 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 401.6 KiB, free 16.9 GiB)
23/06/14 08:16:15 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 24.9 KiB, free 16.9 GiB)
23/06/14 08:16:15 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 00cae536f649:38388 (size: 24.9 KiB, free: 16.9 GiB)
23/06/14 08:16:15 INFO SparkContext: Created broadcast 0 from broadcast at GpuReadCSVFileFormat.scala:46
23/06/14 08:16:15 INFO GpuFileSourceScanExec: Planning scan with bin packing, max size: 1073741824 bytes, open cost is considered as scanning 4194304 bytes.
23/06/14 08:16:15 INFO GpuFileSourceScanExec: Using the original per file parquet reader
23/06/14 08:16:16 INFO CodeGenerator: Code generated in 198.599907 ms
23/06/14 08:16:16 INFO SparkContext: Starting job: collect at GpuRangePartitioner.scala:46
23/06/14 08:16:16 INFO DAGScheduler: Registering RDD 7 (executeColumnar at GpuShuffleCoalesceExec.scala:67) as input to shuffle 0
23/06/14 08:16:16 INFO DAGScheduler: Got job 0 (collect at GpuRangePartitioner.scala:46) with 48 output partitions
23/06/14 08:16:16 INFO DAGScheduler: Final stage: ResultStage 1 (collect at GpuRangePartitioner.scala:46)
23/06/14 08:16:16 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
23/06/14 08:16:16 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0)
23/06/14 08:16:16 INFO DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[7] at executeColumnar at GpuShuffleCoalesceExec.scala:67), which has no missing parents
23/06/14 08:16:16 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 29.9 KiB, free 16.9 GiB)
23/06/14 08:16:16 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 13.0 KiB, free 16.9 GiB)
23/06/14 08:16:16 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 00cae536f649:38388 (size: 13.0 KiB, free: 16.9 GiB)
23/06/14 08:16:16 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1223
23/06/14 08:16:16 INFO DAGScheduler: Submitting 1037 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[7] at executeColumnar at GpuShuffleCoalesceExec.scala:67) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
23/06/14 08:16:16 INFO TaskSchedulerImpl: Adding task set 0.0 with 1037 tasks
23/06/14 08:16:31 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
23/06/14 08:16:46 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
23/06/14 08:17:01 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
23/06/14 08:17:16 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
23/06/14 08:17:31 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
karanveersingh5623 commented 1 year ago

Created docker image again with --build-arg NUMBER_OF_GPUS=2 , now its working

docker build -t nvidia_dlrm_preprocessing -f Dockerfile_preprocessing . --build-arg NUMBER_OF_GPUS=2

karanveersingh5623 commented 1 year ago

Dont close the issue now , just waiting for dataset to complete

karanveersingh5623 commented 1 year ago

@tgrel , everything was running fine but it failed in the end , spark java error

23/06/14 09:00:50 INFO BlockManagerInfo: Added broadcast_199_piece50 in memory on 172.17.0.2:46754 (size: 4.0 MiB, free: 12.4 GiB)
23/06/14 09:00:50 INFO BlockManagerInfo: Added broadcast_199_piece76 in memory on 172.17.0.2:46754 (size: 4.0 MiB, free: 12.4 GiB)
23/06/14 09:00:50 INFO BlockManagerInfo: Added broadcast_199_piece97 in memory on 172.17.0.2:46754 (size: 4.0 MiB, free: 12.4 GiB)
23/06/14 09:00:50 INFO BlockManagerInfo: Added broadcast_199_piece212 in memory on 172.17.0.2:46754 (size: 4.0 MiB, free: 12.4 GiB)
23/06/14 09:00:50 INFO BlockManagerInfo: Added broadcast_199_piece155 in memory on 172.17.0.2:46754 (size: 4.0 MiB, free: 12.4 GiB)
23/06/14 09:00:50 INFO BlockManagerInfo: Added broadcast_199_piece219 in memory on 172.17.0.2:46754 (size: 4.0 MiB, free: 12.4 GiB)
23/06/14 09:00:50 INFO BlockManagerInfo: Added broadcast_199_piece60 in memory on 172.17.0.2:46754 (size: 4.0 MiB, free: 12.4 GiB)
23/06/14 09:00:50 INFO BlockManagerInfo: Added broadcast_199_piece198 in memory on 172.17.0.2:46754 (size: 4.0 MiB, free: 12.4 GiB)
23/06/14 09:00:50 INFO BlockManagerInfo: Added broadcast_199_piece91 in memory on 172.17.0.2:46754 (size: 4.0 MiB, free: 12.4 GiB)
23/06/14 09:00:50 INFO BlockManagerInfo: Added broadcast_199_piece159 in memory on 172.17.0.2:46754 (size: 4.0 MiB, free: 12.4 GiB)
23/06/14 09:00:50 INFO BlockManagerInfo: Added broadcast_199_piece174 in memory on 172.17.0.2:46754 (size: 4.0 MiB, free: 12.4 GiB)
23/06/14 09:00:50 INFO BlockManagerInfo: Added broadcast_199_piece234 in memory on 172.17.0.2:46754 (size: 4.0 MiB, free: 12.4 GiB)
23/06/14 09:00:50 INFO BlockManagerInfo: Added broadcast_199_piece88 in memory on 172.17.0.2:46754 (size: 4.0 MiB, free: 12.4 GiB)
23/06/14 09:00:51 INFO TaskSetManager: Starting task 2.2 in stage 110.0 (TID 370, 172.17.0.2, executor 4, partition 2, PROCESS_LOCAL, 7748 bytes)
23/06/14 09:00:51 WARN TaskSetManager: Lost task 9.2 in stage 110.0 (TID 360, 172.17.0.2, executor 4): java.lang.OutOfMemoryError: Could not allocate native memory: std::bad_alloc: RMM failure at:/usr/local/rapids/include/rmm/mr/device/detail/arena.hpp:382: Maximum pool size exceeded
        at ai.rapids.cudf.Rmm.allocInternal(Native Method)
        at ai.rapids.cudf.Rmm.alloc(Rmm.java:325)
        at ai.rapids.cudf.Rmm.alloc(Rmm.java:314)
        at ai.rapids.cudf.DeviceMemoryBuffer.allocate(DeviceMemoryBuffer.java:125)
        at ai.rapids.cudf.JCudfSerialization.readTableFrom(JCudfSerialization.java:1748)
        at ai.rapids.cudf.JCudfSerialization.readTableFrom(JCudfSerialization.java:1793)
        at org.apache.spark.sql.rapids.execution.SerializeConcatHostBuffersDeserializeBatch.readObject(GpuBroadcastExchangeExec.scala:89)
        at sun.reflect.GeneratedMethodAccessor36.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1184)
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2322)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503)
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461)
        at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
        at org.apache.spark.broadcast.TorrentBroadcast$.$anonfun$unBlockifyObject$4(TorrentBroadcast.scala:328)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
        at org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:330)
        at org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$4(TorrentBroadcast.scala:249)
        at scala.Option.getOrElse(Option.scala:189)
        at org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$2(TorrentBroadcast.scala:223)
        at org.apache.spark.util.KeyLock.withLock(KeyLock.scala:64)
        at org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$1(TorrentBroadcast.scala:218)
        at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1343)
        at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:218)
        at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:103)
        at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
        at com.nvidia.spark.rapids.shims.spark301.GpuBroadcastHashJoinExec.builtTable$lzycompute$1(GpuBroadcastHashJoinExec.scala:142)
        at com.nvidia.spark.rapids.shims.spark301.GpuBroadcastHashJoinExec.builtTable$1(GpuBroadcastHashJoinExec.scala:140)
        at com.nvidia.spark.rapids.shims.spark301.GpuBroadcastHashJoinExec.$anonfun$doExecuteColumnar$5(GpuBroadcastHashJoinExec.scala:156)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:837)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:837)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:127)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)

23/06/14 09:00:52 INFO TaskSetManager: Starting task 9.3 in stage 110.0 (TID 371, 172.17.0.2, executor 5, partition 9, PROCESS_LOCAL, 7748 bytes)
23/06/14 09:00:52 INFO TaskSetManager: Lost task 1.3 in stage 110.0 (TID 367) on 172.17.0.2, executor 5: java.lang.OutOfMemoryError (Could not allocate native memory: std::bad_alloc: RMM failure at:/usr/local/rapids/include/rmm/mr/device/detail/arena.hpp:382: Maximum pool size exceeded) [duplicate 1]
23/06/14 09:00:52 ERROR TaskSetManager: Task 1 in stage 110.0 failed 4 times; aborting job
23/06/14 09:00:52 INFO TaskSchedulerImpl: Cancelling stage 110
23/06/14 09:00:52 INFO TaskSchedulerImpl: Killing all running tasks in stage 110: Stage cancelled
23/06/14 09:00:52 INFO TaskSchedulerImpl: Stage 110 was cancelled
23/06/14 09:00:52 INFO DAGScheduler: ResultStage 110 (collect at GpuRangePartitioner.scala:46) failed in 158.971 s due to Job aborted due to stage failure: Task 1 in stage 110.0 failed 4 times, most recent failure: Lost task 1.3 in stage 110.0 (TID 367, 172.17.0.2, executor 5): java.lang.OutOfMemoryError: Could not allocate native memory: std::bad_alloc: RMM failure at:/usr/local/rapids/include/rmm/mr/device/detail/arena.hpp:382: Maximum pool size exceeded
        at ai.rapids.cudf.Rmm.allocInternal(Native Method)
        at ai.rapids.cudf.Rmm.alloc(Rmm.java:325)
        at ai.rapids.cudf.Rmm.alloc(Rmm.java:314)
        at ai.rapids.cudf.DeviceMemoryBuffer.allocate(DeviceMemoryBuffer.java:125)
        at ai.rapids.cudf.JCudfSerialization.readTableFrom(JCudfSerialization.java:1748)
        at ai.rapids.cudf.JCudfSerialization.readTableFrom(JCudfSerialization.java:1793)
        at org.apache.spark.sql.rapids.execution.SerializeConcatHostBuffersDeserializeBatch.readObject(GpuBroadcastExchangeExec.scala:89)
        at sun.reflect.GeneratedMethodAccessor44.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1184)
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2322)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503)
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461)
        at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
        at org.apache.spark.broadcast.TorrentBroadcast$.$anonfun$unBlockifyObject$4(TorrentBroadcast.scala:328)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
        at org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:330)
        at org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$4(TorrentBroadcast.scala:249)
        at scala.Option.getOrElse(Option.scala:189)
        at org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$2(TorrentBroadcast.scala:223)
        at org.apache.spark.util.KeyLock.withLock(KeyLock.scala:64)
        at org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$1(TorrentBroadcast.scala:218)
        at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1343)
        at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:218)
        at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:103)
        at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
        at com.nvidia.spark.rapids.shims.spark301.GpuBroadcastHashJoinExec.builtTable$lzycompute$1(GpuBroadcastHashJoinExec.scala:142)
        at com.nvidia.spark.rapids.shims.spark301.GpuBroadcastHashJoinExec.builtTable$1(GpuBroadcastHashJoinExec.scala:140)
        at com.nvidia.spark.rapids.shims.spark301.GpuBroadcastHashJoinExec.$anonfun$doExecuteColumnar$5(GpuBroadcastHashJoinExec.scala:156)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:837)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:837)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:127)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)

Driver stacktrace:
23/06/14 09:00:52 INFO DAGScheduler: Job 84 failed: collect at GpuRangePartitioner.scala:46, took 158.987766 s
23/06/14 09:00:52 ERROR GpuFileFormatWriter: Aborting job 7868955a-2f77-4dd0-8a56-d6cce8f52431.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 110.0 failed 4 times, most recent failure: Lost task 1.3 in stage 110.0 (TID 367, 172.17.0.2, executor 5): java.lang.OutOfMemoryError: Could not allocate native memory: std::bad_alloc: RMM failure at:/usr/local/rapids/include/rmm/mr/device/detail/arena.hpp:382: Maximum pool size exceeded
        at ai.rapids.cudf.Rmm.allocInternal(Native Method)
        at ai.rapids.cudf.Rmm.alloc(Rmm.java:325)
        at ai.rapids.cudf.Rmm.alloc(Rmm.java:314)
        at ai.rapids.cudf.DeviceMemoryBuffer.allocate(DeviceMemoryBuffer.java:125)
        at ai.rapids.cudf.JCudfSerialization.readTableFrom(JCudfSerialization.java:1748)
        at ai.rapids.cudf.JCudfSerialization.readTableFrom(JCudfSerialization.java:1793)
        at org.apache.spark.sql.rapids.execution.SerializeConcatHostBuffersDeserializeBatch.readObject(GpuBroadcastExchangeExec.scala:89)
        at sun.reflect.GeneratedMethodAccessor44.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1184)
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2322)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503)
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461)
        at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
        at org.apache.spark.broadcast.TorrentBroadcast$.$anonfun$unBlockifyObject$4(TorrentBroadcast.scala:328)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
        at org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:330)
        at org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$4(TorrentBroadcast.scala:249)
        at scala.Option.getOrElse(Option.scala:189)
        at org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$2(TorrentBroadcast.scala:223)
        at org.apache.spark.util.KeyLock.withLock(KeyLock.scala:64)
        at org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$1(TorrentBroadcast.scala:218)
        at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1343)
        at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:218)
        at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:103)
        at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
        at com.nvidia.spark.rapids.shims.spark301.GpuBroadcastHashJoinExec.builtTable$lzycompute$1(GpuBroadcastHashJoinExec.scala:142)
        at com.nvidia.spark.rapids.shims.spark301.GpuBroadcastHashJoinExec.builtTable$1(GpuBroadcastHashJoinExec.scala:140)
        at com.nvidia.spark.rapids.shims.spark301.GpuBroadcastHashJoinExec.$anonfun$doExecuteColumnar$5(GpuBroadcastHashJoinExec.scala:156)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:837)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:837)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:127)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)

Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2059)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2008)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2007)
        at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
        at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2007)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:973)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:973)
        at scala.Option.foreach(Option.scala:407)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:973)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2239)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2188)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2177)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:775)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2120)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2139)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2164)
        at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1004)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
        at org.apache.spark.rdd.RDD.collect(RDD.scala:1003)
        at com.nvidia.spark.rapids.GpuRangePartitioner.sketch(GpuRangePartitioner.scala:46)
        at com.nvidia.spark.rapids.GpuRangePartitioner.createRangeBounds(GpuRangePartitioner.scala:120)
        at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExec$.getPartitioner(GpuShuffleExchangeExec.scala:270)
        at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExec$.prepareBatchShuffleDependency(GpuShuffleExchangeExec.scala:188)
        at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase.shuffleDependencyColumnar$lzycompute(GpuShuffleExchangeExec.scala:134)
        at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase.shuffleDependencyColumnar(GpuShuffleExchangeExec.scala:125)
        at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase.$anonfun$doExecuteColumnar$1(GpuShuffleExchangeExec.scala:148)
        at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
        at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase.doExecuteColumnar(GpuShuffleExchangeExec.scala:145)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:202)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
        at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:198)
        at com.nvidia.spark.rapids.GpuShuffleCoalesceExec.doExecuteColumnar(GpuShuffleCoalesceExec.scala:67)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:202)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
        at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:198)
        at com.nvidia.spark.rapids.GpuCoalesceBatches.doExecuteColumnar(GpuCoalesceBatches.scala:620)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:202)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
        at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:198)
        at com.nvidia.spark.rapids.GpuSortExec.doExecuteColumnar(GpuSortExec.scala:94)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:202)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
        at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:198)
        at com.nvidia.spark.rapids.GpuProjectExec.doExecuteColumnar(basicPhysicalOperators.scala:82)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:202)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
        at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:198)
        at org.apache.spark.sql.rapids.GpuFileFormatWriter$.write(GpuFileFormatWriter.scala:194)
        at org.apache.spark.sql.rapids.GpuInsertIntoHadoopFsRelationCommand.runColumnar(GpuInsertIntoHadoopFsRelationCommand.scala:166)
        at com.nvidia.spark.rapids.GpuDataWritingCommandExec.sideEffectResult$lzycompute(GpuDataWritingCommandExec.scala:61)
        at com.nvidia.spark.rapids.GpuDataWritingCommandExec.sideEffectResult(GpuDataWritingCommandExec.scala:60)
        at com.nvidia.spark.rapids.GpuDataWritingCommandExec.doExecuteColumnar(GpuDataWritingCommandExec.scala:84)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:202)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
        at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:198)
        at com.nvidia.spark.rapids.GpuColumnarToRowExecParent.doExecute(GpuColumnarToRowExec.scala:286)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
        at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171)
        at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:122)
        at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:121)
        at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:963)
        at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
        at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
        at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
        at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
        at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:963)
        at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:415)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:399)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:288)
        at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:848)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.OutOfMemoryError: Could not allocate native memory: std::bad_alloc: RMM failure at:/usr/local/rapids/include/rmm/mr/device/detail/arena.hpp:382: Maximum pool size exceeded
        at ai.rapids.cudf.Rmm.allocInternal(Native Method)
        at ai.rapids.cudf.Rmm.alloc(Rmm.java:325)
        at ai.rapids.cudf.Rmm.alloc(Rmm.java:314)
        at ai.rapids.cudf.DeviceMemoryBuffer.allocate(DeviceMemoryBuffer.java:125)
        at ai.rapids.cudf.JCudfSerialization.readTableFrom(JCudfSerialization.java:1748)
        at ai.rapids.cudf.JCudfSerialization.readTableFrom(JCudfSerialization.java:1793)
        at org.apache.spark.sql.rapids.execution.SerializeConcatHostBuffersDeserializeBatch.readObject(GpuBroadcastExchangeExec.scala:89)
        at sun.reflect.GeneratedMethodAccessor44.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1184)
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2322)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503)
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461)
        at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
        at org.apache.spark.broadcast.TorrentBroadcast$.$anonfun$unBlockifyObject$4(TorrentBroadcast.scala:328)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
        at org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:330)
        at org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$4(TorrentBroadcast.scala:249)
        at scala.Option.getOrElse(Option.scala:189)
        at org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$2(TorrentBroadcast.scala:223)
        at org.apache.spark.util.KeyLock.withLock(KeyLock.scala:64)
        at org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$1(TorrentBroadcast.scala:218)
        at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1343)
        at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:218)
        at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:103)
        at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
        at com.nvidia.spark.rapids.shims.spark301.GpuBroadcastHashJoinExec.builtTable$lzycompute$1(GpuBroadcastHashJoinExec.scala:142)
        at com.nvidia.spark.rapids.shims.spark301.GpuBroadcastHashJoinExec.builtTable$1(GpuBroadcastHashJoinExec.scala:140)
        at com.nvidia.spark.rapids.shims.spark301.GpuBroadcastHashJoinExec.$anonfun$doExecuteColumnar$5(GpuBroadcastHashJoinExec.scala:156)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:837)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:837)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:127)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        ... 1 more
Traceback (most recent call last):
  File "/workspace/dlrm/preproc/spark_data_utils.py", line 506, in <module>
    _main()
  File "/workspace/dlrm/preproc/spark_data_utils.py", line 499, in _main
    partitionBy=partitionBy)
  File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 936, in parquet
  File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
  File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 128, in deco
  File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o1185.parquet.
: org.apache.spark.SparkException: Job aborted.
        at org.apache.spark.sql.rapids.GpuFileFormatWriter$.write(GpuFileFormatWriter.scala:250)
        at org.apache.spark.sql.rapids.GpuInsertIntoHadoopFsRelationCommand.runColumnar(GpuInsertIntoHadoopFsRelationCommand.scala:166)
        at com.nvidia.spark.rapids.GpuDataWritingCommandExec.sideEffectResult$lzycompute(GpuDataWritingCommandExec.scala:61)
        at com.nvidia.spark.rapids.GpuDataWritingCommandExec.sideEffectResult(GpuDataWritingCommandExec.scala:60)
        at com.nvidia.spark.rapids.GpuDataWritingCommandExec.doExecuteColumnar(GpuDataWritingCommandExec.scala:84)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:202)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
        at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:198)
        at com.nvidia.spark.rapids.GpuColumnarToRowExecParent.doExecute(GpuColumnarToRowExec.scala:286)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
        at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171)
        at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:122)
        at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:121)
        at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:963)
        at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
        at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
        at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
        at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
        at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:963)
        at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:415)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:399)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:288)
        at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:848)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 110.0 failed 4 times, most recent failure: Lost task 1.3 in stage 110.0 (TID 367, 172.17.0.2, executor 5): java.lang.OutOfMemoryError: Could not allocate native memory: std::bad_alloc: RMM failure at:/usr/local/rapids/include/rmm/mr/device/detail/arena.hpp:382: Maximum pool size exceeded
        at ai.rapids.cudf.Rmm.allocInternal(Native Method)
        at ai.rapids.cudf.Rmm.alloc(Rmm.java:325)
        at ai.rapids.cudf.Rmm.alloc(Rmm.java:314)
        at ai.rapids.cudf.DeviceMemoryBuffer.allocate(DeviceMemoryBuffer.java:125)
        at ai.rapids.cudf.JCudfSerialization.readTableFrom(JCudfSerialization.java:1748)
        at ai.rapids.cudf.JCudfSerialization.readTableFrom(JCudfSerialization.java:1793)
        at org.apache.spark.sql.rapids.execution.SerializeConcatHostBuffersDeserializeBatch.readObject(GpuBroadcastExchangeExec.scala:89)
        at sun.reflect.GeneratedMethodAccessor44.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1184)
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2322)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503)
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461)
        at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
        at org.apache.spark.broadcast.TorrentBroadcast$.$anonfun$unBlockifyObject$4(TorrentBroadcast.scala:328)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
        at org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:330)
        at org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$4(TorrentBroadcast.scala:249)
        at scala.Option.getOrElse(Option.scala:189)
        at org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$2(TorrentBroadcast.scala:223)
        at org.apache.spark.util.KeyLock.withLock(KeyLock.scala:64)
        at org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$1(TorrentBroadcast.scala:218)
        at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1343)
        at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:218)
        at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:103)
        at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
        at com.nvidia.spark.rapids.shims.spark301.GpuBroadcastHashJoinExec.builtTable$lzycompute$1(GpuBroadcastHashJoinExec.scala:142)
        at com.nvidia.spark.rapids.shims.spark301.GpuBroadcastHashJoinExec.builtTable$1(GpuBroadcastHashJoinExec.scala:140)
        at com.nvidia.spark.rapids.shims.spark301.GpuBroadcastHashJoinExec.$anonfun$doExecuteColumnar$5(GpuBroadcastHashJoinExec.scala:156)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:837)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:837)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:127)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)

Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2059)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2008)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2007)
        at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
        at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2007)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:973)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:973)
        at scala.Option.foreach(Option.scala:407)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:973)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2239)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2188)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2177)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:775)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2120)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2139)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2164)
        at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1004)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
        at org.apache.spark.rdd.RDD.collect(RDD.scala:1003)
        at com.nvidia.spark.rapids.GpuRangePartitioner.sketch(GpuRangePartitioner.scala:46)
        at com.nvidia.spark.rapids.GpuRangePartitioner.createRangeBounds(GpuRangePartitioner.scala:120)
        at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExec$.getPartitioner(GpuShuffleExchangeExec.scala:270)
        at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExec$.prepareBatchShuffleDependency(GpuShuffleExchangeExec.scala:188)
        at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase.shuffleDependencyColumnar$lzycompute(GpuShuffleExchangeExec.scala:134)
        at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase.shuffleDependencyColumnar(GpuShuffleExchangeExec.scala:125)
        at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase.$anonfun$doExecuteColumnar$1(GpuShuffleExchangeExec.scala:148)
        at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
        at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase.doExecuteColumnar(GpuShuffleExchangeExec.scala:145)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:202)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
        at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:198)
        at com.nvidia.spark.rapids.GpuShuffleCoalesceExec.doExecuteColumnar(GpuShuffleCoalesceExec.scala:67)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:202)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
        at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:198)
        at com.nvidia.spark.rapids.GpuCoalesceBatches.doExecuteColumnar(GpuCoalesceBatches.scala:620)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:202)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
        at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:198)
        at com.nvidia.spark.rapids.GpuSortExec.doExecuteColumnar(GpuSortExec.scala:94)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:202)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
        at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:198)
        at com.nvidia.spark.rapids.GpuProjectExec.doExecuteColumnar(basicPhysicalOperators.scala:82)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:202)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
        at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:198)
        at org.apache.spark.sql.rapids.GpuFileFormatWriter$.write(GpuFileFormatWriter.scala:194)
        ... 39 more
Caused by: java.lang.OutOfMemoryError: Could not allocate native memory: std::bad_alloc: RMM failure at:/usr/local/rapids/include/rmm/mr/device/detail/arena.hpp:382: Maximum pool size exceeded
        at ai.rapids.cudf.Rmm.allocInternal(Native Method)
        at ai.rapids.cudf.Rmm.alloc(Rmm.java:325)
        at ai.rapids.cudf.Rmm.alloc(Rmm.java:314)
        at ai.rapids.cudf.DeviceMemoryBuffer.allocate(DeviceMemoryBuffer.java:125)
        at ai.rapids.cudf.JCudfSerialization.readTableFrom(JCudfSerialization.java:1748)
        at ai.rapids.cudf.JCudfSerialization.readTableFrom(JCudfSerialization.java:1793)
        at org.apache.spark.sql.rapids.execution.SerializeConcatHostBuffersDeserializeBatch.readObject(GpuBroadcastExchangeExec.scala:89)
        at sun.reflect.GeneratedMethodAccessor44.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1184)
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2322)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503)
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461)
        at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
        at org.apache.spark.broadcast.TorrentBroadcast$.$anonfun$unBlockifyObject$4(TorrentBroadcast.scala:328)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
        at org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:330)
        at org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$4(TorrentBroadcast.scala:249)
        at scala.Option.getOrElse(Option.scala:189)
        at org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$2(TorrentBroadcast.scala:223)
        at org.apache.spark.util.KeyLock.withLock(KeyLock.scala:64)
        at org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$1(TorrentBroadcast.scala:218)
        at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1343)
        at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:218)
        at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:103)
        at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
        at com.nvidia.spark.rapids.shims.spark301.GpuBroadcastHashJoinExec.builtTable$lzycompute$1(GpuBroadcastHashJoinExec.scala:142)
        at com.nvidia.spark.rapids.shims.spark301.GpuBroadcastHashJoinExec.builtTable$1(GpuBroadcastHashJoinExec.scala:140)
        at com.nvidia.spark.rapids.shims.spark301.GpuBroadcastHashJoinExec.$anonfun$doExecuteColumnar$5(GpuBroadcastHashJoinExec.scala:156)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:837)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:837)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:127)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        ... 1 more

23/06/14 09:00:52 INFO SparkContext: Invoking stop() from shutdown hook
23/06/14 09:00:52 INFO SparkUI: Stopped Spark web UI at http://c1ee6b3759ad:4040
23/06/14 09:00:52 INFO StandaloneSchedulerBackend: Shutting down all executors
23/06/14 09:00:52 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
23/06/14 09:00:52 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
23/06/14 09:00:52 INFO MemoryStore: MemoryStore cleared
23/06/14 09:00:52 INFO BlockManager: BlockManager stopped
23/06/14 09:00:52 INFO BlockManagerMaster: BlockManagerMaster stopped
23/06/14 09:00:52 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
23/06/14 09:00:52 INFO SparkContext: Successfully stopped SparkContext
23/06/14 09:00:52 INFO ShutdownHookManager: Shutdown hook called
23/06/14 09:00:52 INFO ShutdownHookManager: Deleting directory /tmp/spark-fa523520-3b66-4941-912d-976228e809cb
23/06/14 09:00:52 INFO ShutdownHookManager: Deleting directory /data/dlrm/spark/tmp/spark-8c50cf5d-7a54-43c2-8461-259668e7e0a9
23/06/14 09:00:52 INFO ShutdownHookManager: Deleting directory /data/dlrm/spark/tmp/spark-8c50cf5d-7a54-43c2-8461-259668e7e0a9/pyspark-ff72a275-b26a-4521-9c8e-d510efd54eaf
stopping org.apache.spark.deploy.master.Master
stopping org.apache.spark.deploy.worker.Worker
+ preprocessing_version=Spark
+ conversion_intermediate_dir=/data/dlrm/intermediate_binary
+ final_output_dir=/data/dlrm/binary_dataset
+ source DGX-2_config.sh
++ export TOTAL_CORES=80
++ TOTAL_CORES=80
++ export NUM_EXECUTORS=16
++ NUM_EXECUTORS=16
++ export NUM_EXECUTOR_CORES=5
++ NUM_EXECUTOR_CORES=5
++ export TOTAL_MEMORY=800
++ TOTAL_MEMORY=800
++ export DRIVER_MEMORY=32
++ DRIVER_MEMORY=32
++ export EXECUTOR_MEMORY=32
++ EXECUTOR_MEMORY=32
+ '[' -d /data/dlrm/binary_dataset/train ']'
+ echo 'Performing final conversion to a custom data format'
Performing final conversion to a custom data format
+ python parquet_to_binary.py --parallel_jobs 80 --src_dir /data/dlrm/output --intermediate_dir /data/dlrm/intermediate_binary --dst_dir /data/dlrm/binary_dataset
Processing train files...
0it [00:00, ?it/s]
Train files conversion done
Processing test files...
0it [00:00, ?it/s]
Test files conversion done
Processing validation files...
0it [00:00, ?it/s]
Validation files conversion done
Concatenating train files
cat: '/data/dlrm/intermediate_binary/train/*.bin': No such file or directory
Concatenating test files
cat: '/data/dlrm/intermediate_binary/test/*.bin': No such file or directory
Concatenating validation files
cat: '/data/dlrm/intermediate_binary/validation/*.bin': No such file or directory
Done
+ cp /data/dlrm/output/model_size.json /data/dlrm/binary_dataset/model_size.json
+ python split_dataset.py --dataset /data/dlrm/binary_dataset --output /data/dlrm/binary_dataset/split
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
+ rm /data/dlrm/binary_dataset/train_data.bin
+ rm /data/dlrm/binary_dataset/validation_data.bin
+ rm /data/dlrm/binary_dataset/test_data.bin
+ rm /data/dlrm/binary_dataset/model_size.json
+ mv /data/dlrm/binary_dataset/split/feature_spec.yaml /data/dlrm/binary_dataset/split/test /data/dlrm/binary_dataset/split/train /data/dlrm/binary_dataset/split/validation /data/dlrm/binary_dataset
+ rm -rf /data/dlrm/binary_dataset/split
+ echo 'Done preprocessing the Criteo Kaggle Dataset'
Done preprocessing the Criteo Kaggle Dataset
[root@hpc-wifi ~]# du -sh /mnt/dlrm/dataset_criterio/*
24K     /mnt/dlrm/dataset_criterio/binary_dataset
47G     /mnt/dlrm/dataset_criterio/day_0
48G     /mnt/dlrm/dataset_criterio/day_1
44G     /mnt/dlrm/dataset_criterio/day_10
37G     /mnt/dlrm/dataset_criterio/day_11
40G     /mnt/dlrm/dataset_criterio/day_12
46G     /mnt/dlrm/dataset_criterio/day_13
46G     /mnt/dlrm/dataset_criterio/day_14
45G     /mnt/dlrm/dataset_criterio/day_15
43G     /mnt/dlrm/dataset_criterio/day_16
39G     /mnt/dlrm/dataset_criterio/day_17
34G     /mnt/dlrm/dataset_criterio/day_18
37G     /mnt/dlrm/dataset_criterio/day_19
47G     /mnt/dlrm/dataset_criterio/day_2
46G     /mnt/dlrm/dataset_criterio/day_20
46G     /mnt/dlrm/dataset_criterio/day_21
45G     /mnt/dlrm/dataset_criterio/day_22
43G     /mnt/dlrm/dataset_criterio/day_23
43G     /mnt/dlrm/dataset_criterio/day_3
36G     /mnt/dlrm/dataset_criterio/day_4
41G     /mnt/dlrm/dataset_criterio/day_5
49G     /mnt/dlrm/dataset_criterio/day_6
48G     /mnt/dlrm/dataset_criterio/day_7
46G     /mnt/dlrm/dataset_criterio/day_8
47G     /mnt/dlrm/dataset_criterio/day_9
16K     /mnt/dlrm/dataset_criterio/intermediate_binary
3.9G    /mnt/dlrm/dataset_criterio/output
8.0K    /mnt/dlrm/dataset_criterio/spark
[root@hpc-wifi ~]# ll /mnt/dlrm/dataset_criterio/binary_dataset/
total 20
-rw-r--r-- 1 root root 7241 Jun 14 18:01 feature_spec.yaml
drwxr-xr-x 2 root root 4096 Jun 14 18:01 test
drwxr-xr-x 2 root root 4096 Jun 14 18:01 train
drwxr-xr-x 2 root root 4096 Jun 14 18:01 validation
[root@hpc-wifi ~]# du -sh /mnt/dlrm/dataset_criterio/binary_dataset/*
8.0K    /mnt/dlrm/dataset_criterio/binary_dataset/feature_spec.yaml
4.0K    /mnt/dlrm/dataset_criterio/binary_dataset/test
4.0K    /mnt/dlrm/dataset_criterio/binary_dataset/train
4.0K    /mnt/dlrm/dataset_criterio/binary_dataset/validation
karanveersingh5623 commented 1 year ago

Memory issue , how can I control memory allocations of spark as my GPUs are T4 X2 (16GB each) ?

tgrel commented 1 year ago

Hi @karanveersingh5623, this looks like a CPU out-of-memory issue. Could you please post the output of the following commands so that I can verify your hardware setup?

free -mh
lscpu
nvidia-smi

Thank you, Tomasz

karanveersingh5623 commented 1 year ago

@tgrel , thanks for replying. I will start GPU again and send you the above details but before that I have output from CPU preprocess I dont know why train end up in 0 bytes , no .bin files :(

23/06/14 13:01:51 INFO TaskSetManager: Starting task 24.0 in stage 130.0 (TID 741, 172.17.0.2, executor 1, partition 24, NODE_LOCAL, 7329 bytes)
23/06/14 13:01:51 INFO TaskSetManager: Finished task 10.0 in stage 130.0 (TID 727) in 20099 ms on 172.17.0.2 (executor 1) (5/30)
23/06/14 13:01:51 INFO TaskSetManager: Starting task 25.0 in stage 130.0 (TID 742, 172.17.0.2, executor 0, partition 25, NODE_LOCAL, 7329 bytes)
23/06/14 13:01:51 INFO TaskSetManager: Starting task 26.0 in stage 130.0 (TID 743, 172.17.0.2, executor 1, partition 26, NODE_LOCAL, 7329 bytes)
23/06/14 13:01:51 INFO TaskSetManager: Finished task 19.0 in stage 130.0 (TID 736) in 20249 ms on 172.17.0.2 (executor 0) (6/30)
23/06/14 13:01:51 INFO TaskSetManager: Finished task 0.0 in stage 130.0 (TID 717) in 20252 ms on 172.17.0.2 (executor 1) (7/30)
23/06/14 13:01:51 INFO TaskSetManager: Starting task 27.0 in stage 130.0 (TID 744, 172.17.0.2, executor 1, partition 27, NODE_LOCAL, 7329 bytes)
23/06/14 13:01:51 INFO TaskSetManager: Finished task 4.0 in stage 130.0 (TID 721) in 20429 ms on 172.17.0.2 (executor 1) (8/30)
23/06/14 13:01:51 INFO TaskSetManager: Starting task 28.0 in stage 130.0 (TID 745, 172.17.0.2, executor 1, partition 28, NODE_LOCAL, 7329 bytes)
23/06/14 13:01:51 INFO TaskSetManager: Finished task 12.0 in stage 130.0 (TID 729) in 20523 ms on 172.17.0.2 (executor 1) (9/30)
23/06/14 13:01:52 INFO TaskSetManager: Starting task 29.0 in stage 130.0 (TID 746, 172.17.0.2, executor 1, partition 29, NODE_LOCAL, 7329 bytes)
23/06/14 13:01:52 INFO TaskSetManager: Finished task 18.0 in stage 130.0 (TID 735) in 20672 ms on 172.17.0.2 (executor 1) (10/30)
23/06/14 13:01:52 INFO TaskSetManager: Finished task 13.0 in stage 130.0 (TID 730) in 21198 ms on 172.17.0.2 (executor 0) (11/30)
23/06/14 13:01:52 INFO TaskSetManager: Finished task 1.0 in stage 130.0 (TID 718) in 21408 ms on 172.17.0.2 (executor 0) (12/30)
23/06/14 13:01:52 INFO TaskSetManager: Finished task 2.0 in stage 130.0 (TID 719) in 21459 ms on 172.17.0.2 (executor 1) (13/30)
23/06/14 13:01:52 INFO TaskSetManager: Finished task 5.0 in stage 130.0 (TID 722) in 21538 ms on 172.17.0.2 (executor 0) (14/30)
23/06/14 13:01:53 INFO TaskSetManager: Finished task 7.0 in stage 130.0 (TID 724) in 21605 ms on 172.17.0.2 (executor 0) (15/30)
23/06/14 13:01:53 INFO TaskSetManager: Finished task 11.0 in stage 130.0 (TID 728) in 21796 ms on 172.17.0.2 (executor 0) (16/30)
23/06/14 13:01:53 INFO TaskSetManager: Finished task 15.0 in stage 130.0 (TID 732) in 21931 ms on 172.17.0.2 (executor 0) (17/30)
23/06/14 13:01:54 INFO TaskSetManager: Finished task 9.0 in stage 130.0 (TID 726) in 22686 ms on 172.17.0.2 (executor 0) (18/30)
23/06/14 13:01:54 INFO TaskSetManager: Finished task 16.0 in stage 130.0 (TID 733) in 23062 ms on 172.17.0.2 (executor 1) (19/30)
23/06/14 13:01:54 INFO TaskSetManager: Finished task 3.0 in stage 130.0 (TID 720) in 23107 ms on 172.17.0.2 (executor 0) (20/30)
23/06/14 13:02:07 INFO TaskSetManager: Finished task 20.0 in stage 130.0 (TID 737) in 18266 ms on 172.17.0.2 (executor 1) (21/30)
23/06/14 13:02:08 INFO TaskSetManager: Finished task 22.0 in stage 130.0 (TID 739) in 17050 ms on 172.17.0.2 (executor 1) (22/30)
23/06/14 13:02:08 INFO TaskSetManager: Finished task 27.0 in stage 130.0 (TID 744) in 17084 ms on 172.17.0.2 (executor 1) (23/30)
23/06/14 13:02:09 INFO TaskSetManager: Finished task 23.0 in stage 130.0 (TID 740) in 18019 ms on 172.17.0.2 (executor 0) (24/30)
23/06/14 13:02:09 INFO TaskSetManager: Finished task 21.0 in stage 130.0 (TID 738) in 18957 ms on 172.17.0.2 (executor 1) (25/30)
23/06/14 13:02:09 INFO TaskSetManager: Finished task 25.0 in stage 130.0 (TID 742) in 18253 ms on 172.17.0.2 (executor 0) (26/30)
23/06/14 13:02:10 INFO TaskSetManager: Finished task 26.0 in stage 130.0 (TID 743) in 18857 ms on 172.17.0.2 (executor 1) (27/30)
23/06/14 13:02:10 INFO TaskSetManager: Finished task 28.0 in stage 130.0 (TID 745) in 18689 ms on 172.17.0.2 (executor 1) (28/30)
23/06/14 13:02:12 INFO TaskSetManager: Finished task 24.0 in stage 130.0 (TID 741) in 20546 ms on 172.17.0.2 (executor 1) (29/30)
23/06/14 13:02:12 INFO TaskSetManager: Finished task 29.0 in stage 130.0 (TID 746) in 20321 ms on 172.17.0.2 (executor 1) (30/30)
23/06/14 13:02:12 INFO TaskSchedulerImpl: Removed TaskSet 130.0, whose tasks have all completed, from pool
23/06/14 13:02:12 INFO DAGScheduler: ResultStage 130 (parquet at NativeMethodAccessorImpl.java:0) finished in 41.026 s
23/06/14 13:02:12 INFO DAGScheduler: Job 79 is finished. Cancelling potential speculative or zombie tasks for this job
23/06/14 13:02:12 INFO TaskSchedulerImpl: Killing all running tasks in stage 130: Stage finished
23/06/14 13:02:12 INFO DAGScheduler: Job 79 finished: parquet at NativeMethodAccessorImpl.java:0, took 93.207300 s
23/06/14 13:02:12 INFO FileFormatWriter: Write Job 41f22ef5-860a-4b9a-b4d5-520892209a9d committed.
23/06/14 13:02:12 INFO FileFormatWriter: Finished processing stats for write job 41f22ef5-860a-4b9a-b4d5-520892209a9d.
====================================================================================================
{'transform': 543.4746820926666}
23/06/14 13:02:12 INFO SparkContext: Invoking stop() from shutdown hook
23/06/14 13:02:12 INFO BlockManagerInfo: Removed broadcast_131_piece0 on c1ee6b3759ad:44033 in memory (size: 5.6 KiB, free: 16.9 GiB)
23/06/14 13:02:12 INFO BlockManagerInfo: Removed broadcast_131_piece0 on 172.17.0.2:41086 in memory (size: 5.6 KiB, free: 51.0 GiB)
23/06/14 13:02:12 INFO BlockManagerInfo: Removed broadcast_131_piece0 on 172.17.0.2:39997 in memory (size: 5.6 KiB, free: 51.0 GiB)
23/06/14 13:02:12 INFO BlockManagerInfo: Removed broadcast_146_piece0 on c1ee6b3759ad:44033 in memory (size: 5.6 KiB, free: 16.9 GiB)
23/06/14 13:02:12 INFO BlockManagerInfo: Removed broadcast_146_piece0 on 172.17.0.2:41086 in memory (size: 5.6 KiB, free: 51.0 GiB)
23/06/14 13:02:12 INFO BlockManagerInfo: Removed broadcast_197_piece0 on c1ee6b3759ad:44033 in memory (size: 43.0 KiB, free: 16.9 GiB)
23/06/14 13:02:12 INFO BlockManagerInfo: Removed broadcast_197_piece0 on 172.17.0.2:41086 in memory (size: 43.0 KiB, free: 51.0 GiB)
23/06/14 13:02:12 INFO BlockManagerInfo: Removed broadcast_197_piece0 on 172.17.0.2:39997 in memory (size: 43.0 KiB, free: 51.0 GiB)
23/06/14 13:02:12 INFO SparkUI: Stopped Spark web UI at http://c1ee6b3759ad:4040
23/06/14 13:02:12 INFO StandaloneSchedulerBackend: Shutting down all executors
23/06/14 13:02:12 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
23/06/14 13:02:12 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
23/06/14 13:02:12 INFO MemoryStore: MemoryStore cleared
23/06/14 13:02:12 INFO BlockManager: BlockManager stopped
23/06/14 13:02:12 INFO BlockManagerMaster: BlockManagerMaster stopped
23/06/14 13:02:12 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
23/06/14 13:02:12 INFO SparkContext: Successfully stopped SparkContext
23/06/14 13:02:12 INFO ShutdownHookManager: Shutdown hook called
23/06/14 13:02:12 INFO ShutdownHookManager: Deleting directory /tmp/spark-131ba254-ca77-4d3f-9dba-9f92bc2cab71
23/06/14 13:02:12 INFO ShutdownHookManager: Deleting directory /data/dlrm/spark/tmp/spark-679ccf26-43ae-445d-ba62-1adc004b4e3e/pyspark-2f7b78df-2566-4a6d-91bf-67b6ab5c1385
23/06/14 13:02:12 INFO ShutdownHookManager: Deleting directory /data/dlrm/spark/tmp/spark-679ccf26-43ae-445d-ba62-1adc004b4e3e
+ preprocessing_version=Spark
+ conversion_intermediate_dir=/data/dlrm/intermediate_binary
+ final_output_dir=/data/dlrm/binary_dataset
+ source DGX-2_config.sh
++ export TOTAL_CORES=80
++ TOTAL_CORES=80
++ export NUM_EXECUTORS=16
++ NUM_EXECUTORS=16
++ export NUM_EXECUTOR_CORES=5
++ NUM_EXECUTOR_CORES=5
++ export TOTAL_MEMORY=800
++ TOTAL_MEMORY=800
++ export DRIVER_MEMORY=32
++ DRIVER_MEMORY=32
++ export EXECUTOR_MEMORY=32
++ EXECUTOR_MEMORY=32
+ '[' -d /data/dlrm/binary_dataset/train ']'
+ echo 'Performing final conversion to a custom data format'
Performing final conversion to a custom data format
+ python parquet_to_binary.py --parallel_jobs 80 --src_dir /data/dlrm/output --intermediate_dir /data/dlrm/intermediate_binary --dst_dir /data/dlrm/binary_dataset
Processing train files...
0it [00:00, ?it/s]
Train files conversion done
Processing test files...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [00:00<00:00, 50963.60it/s]
Test files conversion done
Processing validation files...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [00:00<00:00, 14993.94it/s]
Validation files conversion done
Concatenating train files
cat: '/data/dlrm/intermediate_binary/train/*.bin': No such file or directory
Concatenating test files
Concatenating validation files
Done
+ cp /data/dlrm/output/model_size.json /data/dlrm/binary_dataset/model_size.json
+ python split_dataset.py --dataset /data/dlrm/binary_dataset --output /data/dlrm/binary_dataset/split
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2721/2721 [00:20<00:00, 132.62it/s]
0it [00:00, ?it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2721/2721 [00:20<00:00, 130.68it/s]
+ rm /data/dlrm/binary_dataset/train_data.bin
+ rm /data/dlrm/binary_dataset/validation_data.bin
+ rm /data/dlrm/binary_dataset/test_data.bin
+ rm /data/dlrm/binary_dataset/model_size.json
+ mv /data/dlrm/binary_dataset/split/feature_spec.yaml /data/dlrm/binary_dataset/split/test /data/dlrm/binary_dataset/split/train /data/dlrm/binary_dataset/split/validation /data/dlrm/binary_dataset
+ rm -rf /data/dlrm/binary_dataset/split
+ echo 'Done preprocessing the Criteo Kaggle Dataset'
Done preprocessing the Criteo Kaggle Dataset
[root@hpc-wifi ~]# du -sh /mnt/dlrm/dataset_criterio/*
15G     /mnt/dlrm/dataset_criterio/binary_dataset
47G     /mnt/dlrm/dataset_criterio/day_0
48G     /mnt/dlrm/dataset_criterio/day_1
44G     /mnt/dlrm/dataset_criterio/day_10
37G     /mnt/dlrm/dataset_criterio/day_11
40G     /mnt/dlrm/dataset_criterio/day_12
46G     /mnt/dlrm/dataset_criterio/day_13
46G     /mnt/dlrm/dataset_criterio/day_14
45G     /mnt/dlrm/dataset_criterio/day_15
43G     /mnt/dlrm/dataset_criterio/day_16
39G     /mnt/dlrm/dataset_criterio/day_17
34G     /mnt/dlrm/dataset_criterio/day_18
37G     /mnt/dlrm/dataset_criterio/day_19
47G     /mnt/dlrm/dataset_criterio/day_2
46G     /mnt/dlrm/dataset_criterio/day_20
46G     /mnt/dlrm/dataset_criterio/day_21
45G     /mnt/dlrm/dataset_criterio/day_22
43G     /mnt/dlrm/dataset_criterio/day_23
43G     /mnt/dlrm/dataset_criterio/day_3
36G     /mnt/dlrm/dataset_criterio/day_4
41G     /mnt/dlrm/dataset_criterio/day_5
49G     /mnt/dlrm/dataset_criterio/day_6
48G     /mnt/dlrm/dataset_criterio/day_7
46G     /mnt/dlrm/dataset_criterio/day_8
47G     /mnt/dlrm/dataset_criterio/day_9
27G     /mnt/dlrm/dataset_criterio/intermediate_binary
20G     /mnt/dlrm/dataset_criterio/output
12K     /mnt/dlrm/dataset_criterio/spark
[root@hpc-wifi ~]# ll /mnt/dlrm/dataset_criterio/intermediate_binary/
total 12
drwxr-xr-x 2 root root 4096 Jun 14 22:02 test
drwxr-xr-x 2 root root 4096 Jun 14 22:02 train
drwxr-xr-x 2 root root 4096 Jun 14 22:03 validation
[root@hpc-wifi ~]# ll /mnt/dlrm/dataset_criterio/intermediate_binary/train/
total 0
[root@hpc-wifi ~]# du -sh /mnt/dlrm/dataset_criterio/intermediate_binary/*
14G     /mnt/dlrm/dataset_criterio/intermediate_binary/test
4.0K    /mnt/dlrm/dataset_criterio/intermediate_binary/train
14G     /mnt/dlrm/dataset_criterio/intermediate_binary/validation
karanveersingh5623 commented 1 year ago

@tgrel , here are the details you requested .

[root@hpc-wifi ~]# free -mh
              total        used        free      shared  buff/cache   available
Mem:           251G         12G        139G         32G        100G        206G
Swap:          4.0G        1.3G        2.7G
[root@hpc-wifi ~]#
[root@hpc-wifi ~]#
[root@hpc-wifi ~]# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                88
On-line CPU(s) list:   0-87
Thread(s) per core:    2
Core(s) per socket:    22
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Gold 6152 CPU @ 2.10GHz
Stepping:              4
CPU MHz:               2100.000
BogoMIPS:              4200.00
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              30976K
NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86
NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79,81,83,85,87
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 invpcid_single intel_ppin intel_pt ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke md_clear spec_ctrl intel_stibp flush_l1d arch_capabilities
[root@hpc-wifi ~]#
[root@hpc-wifi ~]#
[root@hpc-wifi ~]# nvidia-smi
Sat Jun 17 18:54:45 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.43.04    Driver Version: 515.43.04    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:3B:00.0 Off |                    0 |
| N/A   45C    P0    62W /  70W |  14117MiB / 15360MiB |     78%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            Off  | 00000000:AF:00.0 Off |                    0 |
| N/A   47C    P0    35W /  70W |  14117MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     60506      C   ...-8-openjdk-amd64/bin/java    14081MiB |
|    1   N/A  N/A     60507      C   ...-8-openjdk-amd64/bin/java    14081MiB |
+-----------------------------------------------------------------------------+
karanveersingh5623 commented 1 year ago

@tgrel .....failed at same

23/06/17 10:19:03 INFO BlockManagerInfo: Added broadcast_199_piece179 in memory on 172.17.0.2:35676 (size: 4.0 MiB, free: 12.4 GiB)
23/06/17 10:19:03 INFO BlockManagerInfo: Added broadcast_199_piece29 in memory on 172.17.0.2:35676 (size: 4.0 MiB, free: 12.4 GiB)
23/06/17 10:19:03 INFO BlockManagerInfo: Added broadcast_199_piece92 in memory on 172.17.0.2:35676 (size: 4.0 MiB, free: 12.4 GiB)
23/06/17 10:19:04 INFO BlockManagerInfo: Added broadcast_199_piece5 in memory on 172.17.0.2:35676 (size: 4.0 MiB, free: 12.4 GiB)
23/06/17 10:19:04 INFO BlockManagerInfo: Added broadcast_199_piece117 in memory on 172.17.0.2:35676 (size: 4.0 MiB, free: 12.4 GiB)
23/06/17 10:19:04 INFO BlockManagerInfo: Added broadcast_199_piece175 in memory on 172.17.0.2:35676 (size: 4.0 MiB, free: 12.4 GiB)
23/06/17 10:19:04 INFO BlockManagerInfo: Added broadcast_199_piece112 in memory on 172.17.0.2:35676 (size: 4.0 MiB, free: 12.4 GiB)
23/06/17 10:19:04 INFO BlockManagerInfo: Added broadcast_199_piece101 in memory on 172.17.0.2:35676 (size: 4.0 MiB, free: 12.4 GiB)
23/06/17 10:19:04 INFO BlockManagerInfo: Added broadcast_199_piece141 in memory on 172.17.0.2:35676 (size: 4.0 MiB, free: 12.4 GiB)
23/06/17 10:19:04 INFO BlockManagerInfo: Added broadcast_199_piece71 in memory on 172.17.0.2:35676 (size: 4.0 MiB, free: 12.4 GiB)
23/06/17 10:19:04 INFO BlockManagerInfo: Added broadcast_199_piece189 in memory on 172.17.0.2:35676 (size: 4.0 MiB, free: 12.4 GiB)
23/06/17 10:19:04 INFO BlockManagerInfo: Added broadcast_199_piece163 in memory on 172.17.0.2:35676 (size: 4.0 MiB, free: 12.4 GiB)
23/06/17 10:19:04 INFO BlockManagerInfo: Added broadcast_199_piece103 in memory on 172.17.0.2:35676 (size: 4.0 MiB, free: 12.4 GiB)
23/06/17 10:19:04 INFO TaskSetManager: Starting task 1.2 in stage 110.0 (TID 370, 172.17.0.2, executor 5, partition 1, PROCESS_LOCAL, 7748 bytes)
23/06/17 10:19:04 WARN TaskSetManager: Lost task 7.2 in stage 110.0 (TID 360, 172.17.0.2, executor 5): java.lang.OutOfMemoryError: Could not allocate native memory: std::bad_alloc: RMM failure at:/usr/local/rapids/include/rmm/mr/device/detail/arena.hpp:382: Maximum pool size exceeded
        at ai.rapids.cudf.Rmm.allocInternal(Native Method)
        at ai.rapids.cudf.Rmm.alloc(Rmm.java:325)
        at ai.rapids.cudf.Rmm.alloc(Rmm.java:314)
        at ai.rapids.cudf.DeviceMemoryBuffer.allocate(DeviceMemoryBuffer.java:125)
        at ai.rapids.cudf.JCudfSerialization.readTableFrom(JCudfSerialization.java:1748)
        at ai.rapids.cudf.JCudfSerialization.readTableFrom(JCudfSerialization.java:1793)
        at org.apache.spark.sql.rapids.execution.SerializeConcatHostBuffersDeserializeBatch.readObject(GpuBroadcastExchangeExec.scala:89)
        at sun.reflect.GeneratedMethodAccessor50.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1184)
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2322)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503)
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461)
        at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
        at org.apache.spark.broadcast.TorrentBroadcast$.$anonfun$unBlockifyObject$4(TorrentBroadcast.scala:328)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
        at org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:330)
        at org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$4(TorrentBroadcast.scala:249)
        at scala.Option.getOrElse(Option.scala:189)
        at org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$2(TorrentBroadcast.scala:223)
        at org.apache.spark.util.KeyLock.withLock(KeyLock.scala:64)
        at org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$1(TorrentBroadcast.scala:218)
        at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1343)
        at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:218)
        at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:103)
        at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
        at com.nvidia.spark.rapids.shims.spark301.GpuBroadcastHashJoinExec.builtTable$lzycompute$1(GpuBroadcastHashJoinExec.scala:142)
        at com.nvidia.spark.rapids.shims.spark301.GpuBroadcastHashJoinExec.builtTable$1(GpuBroadcastHashJoinExec.scala:140)
        at com.nvidia.spark.rapids.shims.spark301.GpuBroadcastHashJoinExec.$anonfun$doExecuteColumnar$5(GpuBroadcastHashJoinExec.scala:156)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:837)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:837)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:127)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)

23/06/17 10:19:05 INFO TaskSetManager: Starting task 7.3 in stage 110.0 (TID 371, 172.17.0.2, executor 4, partition 7, PROCESS_LOCAL, 7748 bytes)
23/06/17 10:19:05 INFO TaskSetManager: Lost task 4.2 in stage 110.0 (TID 369) on 172.17.0.2, executor 4: java.lang.OutOfMemoryError (Could not allocate native memory: std::bad_alloc: RMM failure at:/usr/local/rapids/include/rmm/mr/device/detail/arena.hpp:382: Maximum pool size exceeded) [duplicate 1]
23/06/17 10:19:08 ERROR TaskSchedulerImpl: Lost executor 5 on 172.17.0.2: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
23/06/17 10:19:08 WARN TaskSetManager: Lost task 9.2 in stage 110.0 (TID 364, 172.17.0.2, executor 5): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
23/06/17 10:19:08 WARN TaskSetManager: Lost task 1.2 in stage 110.0 (TID 370, 172.17.0.2, executor 5): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
23/06/17 10:19:08 WARN TaskSetManager: Lost task 3.2 in stage 110.0 (TID 361, 172.17.0.2, executor 5): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
23/06/17 10:19:08 WARN TaskSetManager: Lost task 10.2 in stage 110.0 (TID 363, 172.17.0.2, executor 5): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
23/06/17 10:19:08 WARN TaskSetManager: Lost task 2.3 in stage 110.0 (TID 362, 172.17.0.2, executor 5): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
23/06/17 10:19:08 ERROR TaskSetManager: Task 2 in stage 110.0 failed 4 times; aborting job
23/06/17 10:19:08 INFO TaskSchedulerImpl: Cancelling stage 110
23/06/17 10:19:08 INFO TaskSchedulerImpl: Killing all running tasks in stage 110: Stage cancelled
23/06/17 10:19:08 INFO TaskSchedulerImpl: Stage 110 was cancelled
23/06/17 10:19:08 INFO DAGScheduler: ResultStage 110 (collect at GpuRangePartitioner.scala:46) failed in 161.305 s due to Job aborted due to stage failure: Task 2 in stage 110.0 failed 4 times, most recent failure: Lost task 2.3 in stage 110.0 (TID 362, 172.17.0.2, executor 5): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
Driver stacktrace:
23/06/17 10:19:08 INFO DAGScheduler: Job 84 failed: collect at GpuRangePartitioner.scala:46, took 161.321465 s
23/06/17 10:19:08 INFO DAGScheduler: Executor lost: 5 (epoch 30)
23/06/17 10:19:08 INFO BlockManagerMasterEndpoint: Trying to remove executor 5 from BlockManagerMaster.
23/06/17 10:19:08 ERROR GpuFileFormatWriter: Aborting job 752a47d1-0f88-4475-9599-71fe0c84f8f5.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 110.0 failed 4 times, most recent failure: Lost task 2.3 in stage 110.0 (TID 362, 172.17.0.2, executor 5): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2059)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2008)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2007)
        at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
        at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2007)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:973)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:973)
        at scala.Option.foreach(Option.scala:407)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:973)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2239)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2188)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2177)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:775)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2120)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2139)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2164)
        at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1004)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
        at org.apache.spark.rdd.RDD.collect(RDD.scala:1003)
        at com.nvidia.spark.rapids.GpuRangePartitioner.sketch(GpuRangePartitioner.scala:46)
        at com.nvidia.spark.rapids.GpuRangePartitioner.createRangeBounds(GpuRangePartitioner.scala:120)
        at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExec$.getPartitioner(GpuShuffleExchangeExec.scala:270)
        at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExec$.prepareBatchShuffleDependency(GpuShuffleExchangeExec.scala:188)
        at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase.shuffleDependencyColumnar$lzycompute(GpuShuffleExchangeExec.scala:134)
        at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase.shuffleDependencyColumnar(GpuShuffleExchangeExec.scala:125)
        at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase.$anonfun$doExecuteColumnar$1(GpuShuffleExchangeExec.scala:148)
        at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
        at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase.doExecuteColumnar(GpuShuffleExchangeExec.scala:145)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:202)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
        at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:198)
        at com.nvidia.spark.rapids.GpuShuffleCoalesceExec.doExecuteColumnar(GpuShuffleCoalesceExec.scala:67)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:202)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
        at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:198)
        at com.nvidia.spark.rapids.GpuCoalesceBatches.doExecuteColumnar(GpuCoalesceBatches.scala:620)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:202)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
        at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:198)
        at com.nvidia.spark.rapids.GpuSortExec.doExecuteColumnar(GpuSortExec.scala:94)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:202)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
        at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:198)
        at com.nvidia.spark.rapids.GpuProjectExec.doExecuteColumnar(basicPhysicalOperators.scala:82)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:202)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
        at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:198)
        at org.apache.spark.sql.rapids.GpuFileFormatWriter$.write(GpuFileFormatWriter.scala:194)
        at org.apache.spark.sql.rapids.GpuInsertIntoHadoopFsRelationCommand.runColumnar(GpuInsertIntoHadoopFsRelationCommand.scala:166)
        at com.nvidia.spark.rapids.GpuDataWritingCommandExec.sideEffectResult$lzycompute(GpuDataWritingCommandExec.scala:61)
        at com.nvidia.spark.rapids.GpuDataWritingCommandExec.sideEffectResult(GpuDataWritingCommandExec.scala:60)
        at com.nvidia.spark.rapids.GpuDataWritingCommandExec.doExecuteColumnar(GpuDataWritingCommandExec.scala:84)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:202)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
        at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:198)
        at com.nvidia.spark.rapids.GpuColumnarToRowExecParent.doExecute(GpuColumnarToRowExec.scala:286)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
        at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171)
        at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:122)
        at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:121)
        at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:963)
        at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
        at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
        at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
        at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
        at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:963)
        at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:415)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:399)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:288)
        at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:848)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:750)
23/06/17 10:19:08 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(5, 172.17.0.2, 40371, None)
23/06/17 10:19:08 INFO BlockManagerMaster: Removed 5 successfully in removeExecutor
23/06/17 10:19:08 INFO DAGScheduler: Shuffle files lost for executor: 5 (epoch 30)
Traceback (most recent call last):
  File "/workspace/dlrm/preproc/spark_data_utils.py", line 506, in <module>
    _main()
  File "/workspace/dlrm/preproc/spark_data_utils.py", line 499, in _main
    partitionBy=partitionBy)
  File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 936, in parquet
  File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
  File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 128, in deco
  File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o1186.parquet.
: org.apache.spark.SparkException: Job aborted.
        at org.apache.spark.sql.rapids.GpuFileFormatWriter$.write(GpuFileFormatWriter.scala:250)
        at org.apache.spark.sql.rapids.GpuInsertIntoHadoopFsRelationCommand.runColumnar(GpuInsertIntoHadoopFsRelationCommand.scala:166)
        at com.nvidia.spark.rapids.GpuDataWritingCommandExec.sideEffectResult$lzycompute(GpuDataWritingCommandExec.scala:61)
        at com.nvidia.spark.rapids.GpuDataWritingCommandExec.sideEffectResult(GpuDataWritingCommandExec.scala:60)
        at com.nvidia.spark.rapids.GpuDataWritingCommandExec.doExecuteColumnar(GpuDataWritingCommandExec.scala:84)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:202)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
        at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:198)
        at com.nvidia.spark.rapids.GpuColumnarToRowExecParent.doExecute(GpuColumnarToRowExec.scala:286)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
        at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171)
        at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:122)
        at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:121)
        at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:963)
        at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
        at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
        at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
        at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
        at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:963)
        at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:415)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:399)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:288)
        at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:848)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 110.0 failed 4 times, most recent failure: Lost task 2.3 in stage 110.0 (TID 362, 172.17.0.2, executor 5): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2059)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2008)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2007)
        at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
        at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2007)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:973)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:973)
        at scala.Option.foreach(Option.scala:407)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:973)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2239)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2188)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2177)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:775)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2120)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2139)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2164)
        at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1004)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
        at org.apache.spark.rdd.RDD.collect(RDD.scala:1003)
        at com.nvidia.spark.rapids.GpuRangePartitioner.sketch(GpuRangePartitioner.scala:46)
        at com.nvidia.spark.rapids.GpuRangePartitioner.createRangeBounds(GpuRangePartitioner.scala:120)
        at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExec$.getPartitioner(GpuShuffleExchangeExec.scala:270)
        at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExec$.prepareBatchShuffleDependency(GpuShuffleExchangeExec.scala:188)
        at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase.shuffleDependencyColumnar$lzycompute(GpuShuffleExchangeExec.scala:134)
        at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase.shuffleDependencyColumnar(GpuShuffleExchangeExec.scala:125)
        at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase.$anonfun$doExecuteColumnar$1(GpuShuffleExchangeExec.scala:148)
        at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
        at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase.doExecuteColumnar(GpuShuffleExchangeExec.scala:145)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:202)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
        at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:198)
        at com.nvidia.spark.rapids.GpuShuffleCoalesceExec.doExecuteColumnar(GpuShuffleCoalesceExec.scala:67)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:202)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
        at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:198)
        at com.nvidia.spark.rapids.GpuCoalesceBatches.doExecuteColumnar(GpuCoalesceBatches.scala:620)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:202)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
        at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:198)
        at com.nvidia.spark.rapids.GpuSortExec.doExecuteColumnar(GpuSortExec.scala:94)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:202)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
        at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:198)
        at com.nvidia.spark.rapids.GpuProjectExec.doExecuteColumnar(basicPhysicalOperators.scala:82)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:202)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
        at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:198)
        at org.apache.spark.sql.rapids.GpuFileFormatWriter$.write(GpuFileFormatWriter.scala:194)
        ... 39 more

23/06/17 10:19:08 INFO SparkContext: Invoking stop() from shutdown hook
23/06/17 10:19:08 INFO SparkUI: Stopped Spark web UI at http://c1ee6b3759ad:4040
23/06/17 10:19:08 INFO StandaloneSchedulerBackend: Shutting down all executors
23/06/17 10:19:08 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
23/06/17 10:19:08 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
23/06/17 10:19:08 INFO MemoryStore: MemoryStore cleared
23/06/17 10:19:08 INFO BlockManager: BlockManager stopped
23/06/17 10:19:08 INFO BlockManagerMaster: BlockManagerMaster stopped
23/06/17 10:19:08 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
23/06/17 10:19:08 INFO SparkContext: Successfully stopped SparkContext
23/06/17 10:19:08 INFO ShutdownHookManager: Shutdown hook called
23/06/17 10:19:08 INFO ShutdownHookManager: Deleting directory /data/dlrm/spark/tmp/spark-530d41b5-b5e3-4cc4-8de0-94d4f44acb29
23/06/17 10:19:08 INFO ShutdownHookManager: Deleting directory /data/dlrm/spark/tmp/spark-530d41b5-b5e3-4cc4-8de0-94d4f44acb29/pyspark-757ec719-3fdf-48e3-993c-ee7630fbc06c
23/06/17 10:19:08 INFO ShutdownHookManager: Deleting directory /tmp/spark-765d0a7a-b253-4a6c-bde2-094058b42b4b
stopping org.apache.spark.deploy.master.Master
stopping org.apache.spark.deploy.worker.Worker
+ preprocessing_version=Spark
+ conversion_intermediate_dir=/data/dlrm/intermediate_binary
+ final_output_dir=/data/dlrm/binary_dataset
+ source DGX-2_config.sh
++ export TOTAL_CORES=80
++ TOTAL_CORES=80
++ export NUM_EXECUTORS=16
++ NUM_EXECUTORS=16
++ export NUM_EXECUTOR_CORES=5
++ NUM_EXECUTOR_CORES=5
++ export TOTAL_MEMORY=800
++ TOTAL_MEMORY=800
++ export DRIVER_MEMORY=32
++ DRIVER_MEMORY=32
++ export EXECUTOR_MEMORY=32
++ EXECUTOR_MEMORY=32
+ '[' -d /data/dlrm/binary_dataset/train ']'
+ echo 'Performing final conversion to a custom data format'
Performing final conversion to a custom data format
+ python parquet_to_binary.py --parallel_jobs 80 --src_dir /data/dlrm/output --intermediate_dir /data/dlrm/intermediate_binary --dst_dir /data/dlrm/binary_dataset
Processing train files...
0it [00:00, ?it/s]
Train files conversion done
Processing test files...
0it [00:00, ?it/s]
Test files conversion done
Processing validation files...
0it [00:00, ?it/s]
Validation files conversion done
Concatenating train files
cat: '/data/dlrm/intermediate_binary/train/*.bin': No such file or directory
Concatenating test files
cat: '/data/dlrm/intermediate_binary/test/*.bin': No such file or directory
Concatenating validation files
cat: '/data/dlrm/intermediate_binary/validation/*.bin': No such file or directory
Done
+ cp /data/dlrm/output/model_size.json /data/dlrm/binary_dataset/model_size.json
+ python split_dataset.py --dataset /data/dlrm/binary_dataset --output /data/dlrm/binary_dataset/split
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
+ rm /data/dlrm/binary_dataset/train_data.bin
+ rm /data/dlrm/binary_dataset/validation_data.bin
+ rm /data/dlrm/binary_dataset/test_data.bin
+ rm /data/dlrm/binary_dataset/model_size.json
+ mv /data/dlrm/binary_dataset/split/feature_spec.yaml /data/dlrm/binary_dataset/split/test /data/dlrm/binary_dataset/split/train /data/dlrm/binary_dataset/split/validation /data/dlrm/binary_dataset
+ rm -rf /data/dlrm/binary_dataset/split
+ echo 'Done preprocessing the Criteo Kaggle Dataset'
Done preprocessing the Criteo Kaggle Dataset
karanveersingh5623 commented 1 year ago

@tgrel , anything on above ?

tgrel commented 1 year ago

Hi @karanveersingh5623,

I think the problem is caused by too low memory on your system. I've never tested the Criteo 1TB preprocessing on T4 GPUs, nor on machines with less than 1TB of CPU memory. Please note this dataset is very large and unfortunately the hardware requirements are quite stringent. I am going to clearly spell out those requirements in a future release.

If you're unable to get a machine with more memory and don't need to test for convergence, then I suggest using synthetic data. This should be achievable by passing --dataset_type synthetic_gpu to the main training script.

psharpe99 commented 1 year ago

Hi @tgrel I found this report/thread because I'm having the same Spark resource problem when trying to preprocess the DLRM TF2 example. However, as a proof of process, I've reduced my criteo/day_0...day_23 files so that they only have 1000 lines in each, vastly reducing the dtaaset, and are therefore pseudo-synthetic data, yet I still get the same WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources I was just wondering if the "stringent" hardware requirements that you mentioned above are effectively hard-coded and not based on the size of the input files? I noticed in the README.md mention of 3Tb and 4Tb disk space requirements, for example, and I was hoping that reducing the file sizes would reduce the disk and memory requirements to where they would sit comfortably in my environment, but at the moment it seems that this is not the case... Apologies if I'm commenting out of place... Thanks!

tgrel commented 1 year ago

Hi @psharpe99,

You are correct. The hardware requirements are indeed specifically tuned to preprocess the full Criteo 1TB dataset. This example doesn't set out to support all possible datasets and hardware platforms automatically. Unfortunately, hand-tuning is required.

If you'd like to run with a different dataset or different hardware, you'll need to modify the hardcoded values to suit your setup. A good place to start would be to add a new config file like this one: https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/Recommendation/DLRM/preproc/DGX-A100_config.sh

and then to rebuild the preprocessing docker image passing the correct config name and other arguments: https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/Recommendation/DLRM/Dockerfile_preprocessing#L18C1-L19C1

Some more detailed hardware configuration can also be found here: https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/Recommendation/DLRM/preproc/run_spark_gpu_DGX-A100.sh#L53 or here: https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/Recommendation/DLRM/preproc/run_NVTabular.sh#L58 depending on whether you want to run with the spark GPU plugin or with NVTabular. Although, I think changing those last two files might not be necessary if you don't need peak performance.

I'll try to update the READMEs to make this information more visible.

Separately, I must admit the error message you got is very misleading. I will see if inconsistencies in hardware config can be detected at runtime and reported more clearly. Thank you for bringing this to my attention.

I hope this helps.

karanveersingh5623 commented 1 year ago

Hi @tgrel , thanks :) I tried running synthetic and it was fine . Just need some inputs .

I have 4 X A100 (80 GB ) GPUs on a single node with 256 GB of CPU memory . Can I reach around 2 ~ 3 GB/s read IOs ?

Just let me know any params apart from batch_size :)

tgrel commented 1 year ago

The --dataset_type synthetic_gpu method uses synthetic data generated on the fly. It doesn't read anything from disk. If you want to stress the I/O, there's a method to generate the synthetic data and store it on disk. You can use this script for it: https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/Recommendation/DLRM/dlrm/scripts/prepare_synthetic_dataset.py

Once you've saved the data, you should remove the --dataset_type synthetic_gpu option from the training command-line and instead pass the path to the synthetic data you've saved on disk. You can achieve this with: dlrm.scripts.main --dataset <path_to_synthetic_data>

I haven't tested this example on a 4xA100-80GB node. My quick estimate is that the default settings in this script go through 1.7GB of compressed data per second with a full DGX A100-80GB. However, you could increase this by making the neural network faster, e.g., by making the Top MLM smaller (see this parameter: https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/Recommendation/DLRM/dlrm/scripts/main.py#L70) or by decreasing the embedding dimension (see this parameter: https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/Recommendation/DLRM/dlrm/scripts/main.py#L69)

Alternatively, you could write a script that only performs dataloading and nothing else.

karanveersingh5623 commented 1 year ago

@tgrel ...waoo :) let me try all ... Thanks again

karanveersingh5623 commented 1 year ago

@tgrel , i tried your options and was able to get around 1.7GB/s , training was fine but test/validation failed . I haven't changed test_batch_size , where should I make changes to fit test dataset in cuda mem , is it lowering test_batch_size or any other params ?

root@6fd2f5fc8fc4:/workspace/dlrm# python -m torch.distributed.launch --no_python --use_env --nproc_per_node 4           bash  -c './bind.sh python -m dlrm.scripts.main \
          --dataset /mnt/dlrm_synthetic_data/ --seed 0 --epochs 1 --amp --cuda_graphs --batch_size 3407872'
/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
  warnings.warn(
WARNING:torch.distributed.run:*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
dpkg: warning: version '4.18.0-372.9.1.el8.x86_64' has bad syntax: invalid character in revision number
dpkg: warning: version '4.18.0-372.9.1.el8.x86_64' has bad syntax: invalid character in revision number
dpkg: warning: version '4.18.0-372.9.1.el8.x86_64' has bad syntax: invalid character in revision number
dpkg: warning: version '4.18.0-372.9.1.el8.x86_64' has bad syntax: invalid character in revision number
dpkg: warning: version '4.18.0-372.9.1.el8.x86_64' has bad syntax: invalid character in revision number
dpkg: warning: version '4.18.0-372.9.1.el8.x86_64' has bad syntax: invalid character in revision number
dpkg: warning: version '4.18.0-372.9.1.el8.x86_64' has bad syntax: invalid character in revision number
dpkg: warning: version '4.18.0-372.9.1.el8.x86_64' has bad syntax: invalid character in revision number
num_sockets = 2 num_nodes=2 cores_per_socket=32
num_sockets = 2 num_nodes=2 cores_per_socket=32
num_sockets = 2 num_nodes=2 cores_per_socket=32
num_sockets = 2 num_nodes=2 cores_per_socket=32
I0706 02:35:26.907477 140737350231872 distributed_c10d.py:218] Added key: store_based_barrier_key:1 to store for rank: 1
I0706 02:35:26.907486 140737350231872 distributed_c10d.py:218] Added key: store_based_barrier_key:1 to store for rank: 0
I0706 02:35:26.907539 140737350231872 distributed_c10d.py:218] Added key: store_based_barrier_key:1 to store for rank: 2
I0706 02:35:26.907540 140737350231872 distributed_c10d.py:218] Added key: store_based_barrier_key:1 to store for rank: 3
I0706 02:35:26.907639 140737350231872 distributed_c10d.py:252] Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
I0706 02:35:26.907645 140737350231872 distributed_c10d.py:252] Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
I0706 02:35:26.907689 140737350231872 distributed_c10d.py:252] Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
I0706 02:35:26.907697 140737350231872 distributed_c10d.py:252] Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
DLL 2023-07-06 02:35:26.939505 - PARAMETER logtostderr : False  alsologtostderr : False  log_dir :   v : 0  verbosity : 0  logger_levels : {}  stderrthreshold : fatal  showprefixforinfo : True  run_with_pdb : False  pdb_post_mortem : False  pdb : False  run_with_profiling : False  profile_file : None  use_cprofile_for_profiling : True  only_check_args : False  mode : train  seed : 0  batch_size : 3407872  test_batch_size : 65536  lr : 24.0  epochs : 1  max_steps : None  warmup_factor : 0  warmup_steps : 8000  decay_steps : 24000  decay_start_step : 48000  decay_power : 2  decay_end_lr : 0.0  embedding_type : joint_sparse  embedding_dim : 16  top_mlp_sizes : [1024, 512, 256, 1]  bottom_mlp_sizes : [64, 32, 16]  interaction_op : cuda_dot  dataset : /mnt/dlrm_synthetic_data/  feature_spec : feature_spec.yaml  dataset_type : parametric  shuffle : False  shuffle_batch_order : False  max_table_size : None  hash_indices : False  synthetic_dataset_num_entries : 33554432  synthetic_dataset_table_sizes : ['100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000']  synthetic_dataset_numerical_features : 13  synthetic_dataset_use_feature_spec : False  load_checkpoint_path : None  save_checkpoint_path : None  log_path : ./log.json  test_freq : None  test_after : 0.0  print_freq : 200  benchmark_warmup_steps : 0  base_device : cuda  amp : True  cuda_graphs : True  inference_benchmark_batch_sizes : [1, 64, 4096]  inference_benchmark_steps : 200  auc_threshold : None  optimized_mlp : True  auc_device : GPU  backend : nccl  bottom_features_ordered : False  freeze_mlps : False  freeze_embeddings : False  Adam_embedding_optimizer : False  Adam_MLP_optimizer : False  ? : False  help : False  helpshort : False  helpfull : False  helpxml : False
/workspace/dlrm/dlrm/data/datasets.py:191: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at  /opt/pytorch/pytorch/torch/csrc/utils/tensor_numpy.cpp:181.)
  return torch.from_numpy(array).to(torch.float32)
Epoch:[0/1] [200/615]  eta: 0:02:01  loss: 0.69666475  step_time: 0.291955  lr: 0.6030
Epoch:[0/1] [400/615]  eta: 0:01:04  loss: 0.69310302  step_time: 0.304464  lr: 1.2030
Epoch:[0/1] [600/615]  eta: 0:00:04  loss: 0.69303346  step_time: 0.305558  lr: 1.8030
Test: [200/32000]  step_time: 0.0029
Test: [400/32000]  step_time: 0.0031
Test: [600/32000]  step_time: 0.0025
Test: [800/32000]  step_time: 0.0028
Test: [1000/32000]  step_time: 0.0027
Test: [1200/32000]  step_time: 0.0027
Test: [1400/32000]  step_time: 0.0031
Test: [1600/32000]  step_time: 0.0029
Test: [1800/32000]  step_time: 0.0029
Test: [2000/32000]  step_time: 0.0030
Test: [2200/32000]  step_time: 0.0027
Test: [2400/32000]  step_time: 0.0027
Test: [2600/32000]  step_time: 0.0028
Test: [2800/32000]  step_time: 0.0029
Test: [3000/32000]  step_time: 0.0027
Test: [3200/32000]  step_time: 0.0029
Test: [3400/32000]  step_time: 0.0027
Test: [3600/32000]  step_time: 0.0026
.
.
.
Test: [31200/32000]  step_time: 0.0034
Test: [31400/32000]  step_time: 0.0028
Test: [31600/32000]  step_time: 0.0028
Test: [31800/32000]  step_time: 0.0029
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/workspace/dlrm/dlrm/scripts/main.py", line 842, in <module>
    app.run(main)
  File "/opt/conda/lib/python3.8/site-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/opt/conda/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "/workspace/dlrm/dlrm/scripts/main.py", line 683, in main
    auc, validation_loss = dist_evaluate(trainer.model, data_loader_test)
  File "/workspace/dlrm/dlrm/scripts/main.py", line 826, in dist_evaluate
    auc = utils.roc_auc_score(y_true, y_score)
  File "/workspace/dlrm/dlrm/scripts/utils.py", line 302, in roc_auc_score
    desc_score_indices = torch.argsort(y_score, descending=True)
RuntimeError: CUDA out of memory. Tried to allocate 15.62 GiB (GPU 0; 79.35 GiB total capacity; 57.42 GiB already allocated; 8.57 GiB free; 69.29 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1060 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1061 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1071 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1059) of binary: bash
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 689, in run
    elastic_launch(
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
***************************************
              bash FAILED
=======================================
Root Cause:
[0]:
  time: 2023-07-06_02:43:46
  rank: 0 (local_rank: 0)
  exitcode: 1 (pid: 1059)
  error_file: <N/A>
  msg: "Process failed with exitcode 1"
=======================================
Other Failures:
  <NO_OTHER_FAILURES>
***************************************
karanveersingh5623 commented 1 year ago

@tgrel , I tried getting around inference , its running fine , getting around 3GBps with 32768 inference_batch_size. Check the output below .

One Q ......how can I increase the duration and amount of data to be accessed during inference benchmarking ? Atleast for few minutes with 500GB of synthetic dataset so that my DRAM cache(250GB) is fully filled and I can see more activity from NVMe SSD .

root@6fd2f5fc8fc4:/workspace/dlrm# python -m dlrm.scripts.main --mode inference_benchmark --dataset /mnt/dlrm_synthetic_data/ --cuda_graphs
Not using distributed mode
DLL 2023-07-07 02:41:29.795327 - PARAMETER logtostderr : False  alsologtostderr : False  log_dir :   v : 0  verbosity : 0  logger_levels : {}  stderrthreshold : fatal  showprefixforinfo : True  run_with_pdb : False  pdb_post_mortem : False  pdb : False  run_with_profiling : False  profile_file : None  use_cprofile_for_profiling : True  only_check_args : False  mode : inference_benchmark  seed : 12345  batch_size : 65536  test_batch_size : 65536  lr : 24.0  epochs : 1  max_steps : None  warmup_factor : 0  warmup_steps : 8000  decay_steps : 24000  decay_start_step : 48000  decay_power : 2  decay_end_lr : 0.0  embedding_type : joint_sparse  embedding_dim : 16  top_mlp_sizes : [1024, 512, 256, 1]  bottom_mlp_sizes : [64, 32, 16]  interaction_op : cuda_dot  dataset : /mnt/dlrm_synthetic_data/  feature_spec : feature_spec.yaml  dataset_type : parametric  shuffle : False  shuffle_batch_order : False  max_table_size : None  hash_indices : False  synthetic_dataset_num_entries : 33554432  synthetic_dataset_table_sizes : ['100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000', '100000']  synthetic_dataset_numerical_features : 13  synthetic_dataset_use_feature_spec : False  load_checkpoint_path : None  save_checkpoint_path : None  log_path : ./log.json  test_freq : None  test_after : 0.0  print_freq : 200  benchmark_warmup_steps : 0  base_device : cuda  amp : False  cuda_graphs : True  inference_benchmark_batch_sizes : [1, 64, 4096, 8192, 16384, 32768, 32768, 32768, 32768, 32768]  inference_benchmark_steps : 200  auc_threshold : None  optimized_mlp : True  auc_device : GPU  backend : nccl  bottom_features_ordered : False  freeze_mlps : False  freeze_embeddings : False  Adam_embedding_optimizer : False  Adam_MLP_optimizer : False  ? : False  help : False  helpshort : False  helpfull : False  helpxml : False
/workspace/dlrm/dlrm/data/datasets.py:191: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at  /opt/pytorch/pytorch/torch/csrc/utils/tensor_numpy.cpp:181.)
  return torch.from_numpy(array).to(torch.float32)
auc:  0.5738749504089355
auc:  0.5036291480064392
auc:  0.5011399984359741
auc:  0.5010834336280823
auc:  0.5011400580406189
auc:  0.5011354684829712
auc:  0.5011354684829712
auc:  0.5011354684829712
auc:  0.5011354684829712
auc:  0.5011354684829712
DLL 2023-07-07 02:41:58.089163 - () mean_inference_latency_batch_1 : 0.00010312290091789205 s mean_inference_throughput_batch_1 : 9697.167080241608 samples/s mean_inference_latency_batch_64 : 0.00014477624943119068 s mean_inference_throughput_batch_64 : 442061.45864013385 samples/s mean_inference_latency_batch_4096 : 0.00029346580904815835 s mean_inference_throughput_batch_4096 : 13957332.92844291 samples/s mean_inference_latency_batch_8192 : 0.000480277376025135  mean_inference_throughput_batch_8192 : 17056810.10377486  mean_inference_latency_batch_16384 : 0.000780940679979574  mean_inference_throughput_batch_16384 : 20979826.534876548  mean_inference_latency_batch_32768 : 0.0014718452673307889  mean_inference_throughput_batch_32768 : 22263209.813776966
karanveersingh5623 commented 1 year ago

@tgrel , anything you can share on above 2 queries ?

karanveersingh5623 commented 1 year ago

@tgrel

I commented auc calculation in main.py and its a way to get it work but not a correct solution . How can we calculate AUC using multiple GPUs , do we need to change something in utils.py ?

if is_main_process():
            y_true = torch.cat(y_true)
            y_score = torch.sigmoid(torch.cat(y_score)).float()
            auc = None
            #auc = utils.roc_auc_score(y_true, y_score)
            loss = loss_fn(y_score, y_true).item()
            print(f'test loss: {loss:.8f}', )
Test: [27800/32000]  step_time: 0.0029
Test: [28000/32000]  step_time: 0.0032
Test: [28200/32000]  step_time: 0.0028
Test: [28400/32000]  step_time: 0.0029
Test: [28600/32000]  step_time: 0.0028
Test: [28800/32000]  step_time: 0.0028
Test: [29000/32000]  step_time: 0.0030
Test: [29200/32000]  step_time: 0.0028
Test: [29400/32000]  step_time: 0.0030
Test: [29600/32000]  step_time: 0.0028
Test: [29800/32000]  step_time: 0.0029
Test: [30000/32000]  step_time: 0.0028
Test: [30200/32000]  step_time: 0.0028
Test: [30400/32000]  step_time: 0.0028
Test: [30600/32000]  step_time: 0.0028
Test: [30800/32000]  step_time: 0.0029
Test: [31000/32000]  step_time: 0.0030
Test: [31200/32000]  step_time: 0.0029
Test: [31400/32000]  step_time: 0.0028
Test: [31600/32000]  step_time: 0.0028
Test: [31800/32000]  step_time: 0.0028
test loss: 2.65082216
Finished epoch 0 in 0:07:48.
DLL 2023-07-11 07:29:00.994531 - () best_auc : 0.00000 None best_validation_loss : 1000000.00000  training_loss : 2.98322  best_epoch : 0.00  average_train_throughput : 1.25e+07 samples/s
tgrel commented 1 year ago

Hi @karanveersingh5623 , please find the answers to your questions below.

1) Regarding the out-of-memory error. It looks like you're trying to compute AUC score of an extremely long test dataset (32k batches). This is currently not supported. I don't think this is a large problem. It doesn't make sense to compute AUC on a synthetic dataset. My understanding is that this is irrelevant to your benchmarking efforts. As a side note, your train batch size is very large. I have not tested such large values and cannot guarantee they work correctly.

2) To increase the number of samples of the synthetic dataset, just change this flag to a desired value. This will let you benchmark I/O with a filled cache. Please bear in mind that it'll take a while to generate such a large dataset.

3) I'm not sure why you are trying to compute AUC for a synthetic dataset. I think having it commented out just for your benchmarking is a useful workaround. If you'd still like to fix the OOM error I've mentioned in point 1), you'd need to write a new procedure that computes AUC iteratively and combines the results.

karanveersingh5623 commented 1 year ago

@tgrel , thanks for coming back . Right , I have commented AUC as its synthetic dataset . I have another query regarding model training .

Below are my top , bottom MLP , batch sizes and NVIDIA-SMI output of model training . Q is 4 X A100 (80GB) gpus are just using 14GB max HBM memory per GPU when given below mentioned parameters which is generating 1.7GBps IOs . How can we double it without changing Batch_size ? Because after increase Batch_size more than 32K , it fails with below error Working params

flags.DEFINE_integer("batch_size", 3407872, "Batch size used for training")

flags.DEFINE_integer("embedding_dim", 8, "Dimensionality of embedding space for categorical features")
flags.DEFINE_list("top_mlp_sizes", [1024, 512, 256, 1], "Linear layer sizes for the top MLP")
flags.DEFINE_list("bottom_mlp_sizes", [32, 16, 8], "Linear layer sizes for the bottom MLP")
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/workspace/dlrm/dlrm/scripts/main.py", line 927, in <module>
    app.run(main)
  File "/opt/conda/lib/python3.8/site-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/opt/conda/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "/workspace/dlrm/dlrm/scripts/main.py", line 645, in main
    loss = trainer.train_step(numerical_features, categorical_features, click)
  File "/workspace/dlrm/dlrm/scripts/main.py", line 267, in train_step
    return self._warmup_step(*train_step_args)
  File "/workspace/dlrm/dlrm/scripts/main.py", line 252, in _warmup_step
    self.loss = self._train_step(self.model, *self.static_args)
  File "/workspace/dlrm/dlrm/scripts/main.py", line 595, in forward_backward
    scaler.scale(loss).backward()
  File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/__init__.py", line 154, in backward
    Variable._execution_engine.run_backward(
SystemError: <built-in method run_backward of torch._C._EngineBase object at 0x7fff36f556e0> returned NULL without setting an error
GEMM wgrad failed with 13

The error message indicates an issue during the backward pass of the training step in the DLRM script. It looks like a SystemError with the GEMM (General Matrix Multiply) operation for weight gradient computation.

Could be Memory Overflow but GPU memories are under utilized The error could be due to insufficient memory to perform the backward pass if training with large embedding tables.

image

karanveersingh5623 commented 1 year ago

Hi @karanveersingh5623 , please find the answers to your questions below.

  1. Regarding the out-of-memory error. It looks like you're trying to compute AUC score of an extremely long test dataset (32k batches). This is currently not supported. I don't think this is a large problem. It doesn't make sense to compute AUC on a synthetic dataset. My understanding is that this is irrelevant to your benchmarking efforts. As a side note, your train batch size is very large. I have not tested such large values and cannot guarantee they work correctly.
  2. To increase the number of samples of the synthetic dataset, just change this flag to a desired value. This will let you benchmark I/O with a filled cache. Please bear in mind that it'll take a while to generate such a large dataset.
  3. I'm not sure why you are trying to compute AUC for a synthetic dataset. I think having it commented out just for your benchmarking is a useful workaround. If you'd still like to fix the OOM error I've mentioned in point 1), you'd need to write a new procedure that computes AUC iteratively and combines the results.
  1. Regarding point No 2 , I wanted to increase size and duration of Inference benchmark . Previous mentioned output of inference benchmark shows output with 512GB of synthetic Dataset where inference finishes within few seconds. I used below mentioned flags for data generation both in model training and inference_benchmarking
flags.DEFINE_integer("synthetic_dataset_num_entries",
                     default=int(500 * 1024 * 1024 * 32 / 8),  # 1024 batches for single-GPU training by default
                     help="Number of samples per epoch for the synthetic dataset")
  1. Inference Benchmark is using single GPU , how to implement def parallelize(model) in def inference_benchmark_nongraphed for multi-gpu inference
karanveersingh5623 commented 1 year ago

@tgrel , anything can you share on above ?

tgrel commented 1 year ago

Hi @karanveersingh5623,

Regarding 1) – there's a command-line argument here that controls the number of steps for the inference benchmark. Increasing this value appropriately should solve the issue.

Regarding 2) – inference on a model that fits on a singleGPU is a trivially parallelizable workload. You can just run 4 scripts in parallel, one for each GPU.

Closing the issue.