NVIDIA / spark-rapids-benchmarks

Spark RAPIDS Benchmarks – benchmark sets and utilities for the RAPIDS Accelerator for Apache Spark
Apache License 2.0
36 stars 27 forks source link

[QST] Cannot run on GPU because GpuCSVScan only supports UTF8 encoded data #185

Closed chasen918 closed 4 months ago

chasen918 commented 4 months ago

What is your question?

After generating 100GB data with nds_gen_data.py as README wrote, I cannot use my GPU(Tesla T4) to convert them to parquet. In fact, I got a warn as below and the work is stucked then. (base) [root@localhost 13:41 /home/gpu/spark-rapids-benchmarks/nds]# ./test_transcode.sh

24/05/11 13:42:38 INFO GpuOverrides: Plan conversion to the GPU took 59.87 ms 24/05/11 13:42:38 INFO GpuOverrides: GPU plan transition optimization took 15.15 ms 24/05/11 13:42:38 INFO GpuParquetFileFormat: Using default output committer for Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter 24/05/11 13:42:38 INFO FileOutputCommitter: File Output Committer Algorithm version is 1 24/05/11 13:42:38 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false 24/05/11 13:42:38 INFO SQLHadoopMapReduceCommitProtocol: Using user defined output committer class org.apache.parquet.hadoop.ParquetOutputCommitter 24/05/11 13:42:38 INFO FileOutputCommitter: File Output Committer Algorithm version is 1 24/05/11 13:42:38 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false 24/05/11 13:42:38 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.parquet.hadoop.ParquetOutputCommitter 24/05/11 13:42:38 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 367.9 KiB, free 911.9 MiB) 24/05/11 13:42:38 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 37.1 KiB, free 911.9 MiB) 24/05/11 13:42:38 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.110.13:37311 (size: 37.1 KiB, free: 912.3 MiB) 24/05/11 13:42:38 INFO SparkContext: Created broadcast 0 from execute at GpuRowToColumnarExec.scala:896 24/05/11 13:42:38 INFO FileSourceScanExec: Planning scan with bin packing, max size: 536870912 bytes, open cost is considered as scanning 4194304 bytes. 24/05/11 13:42:38 INFO SparkContext: Starting job: runColumnar at GpuDataWritingCommandExec.scala:117 24/05/11 13:42:38 INFO DAGScheduler: Got job 0 (runColumnar at GpuDataWritingCommandExec.scala:117) with 1 output partitions 24/05/11 13:42:38 INFO DAGScheduler: Final stage: ResultStage 0 (runColumnar at GpuDataWritingCommandExec.scala:117) 24/05/11 13:42:38 INFO DAGScheduler: Parents of final stage: List() 24/05/11 13:42:38 INFO DAGScheduler: Missing parents: List() 24/05/11 13:42:38 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[6] at runColumnar at GpuDataWritingCommandExec.scala:117), which has no missing parents 24/05/11 13:42:38 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 231.5 KiB, free 911.7 MiB) 24/05/11 13:42:38 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 85.1 KiB, free 911.6 MiB) 24/05/11 13:42:38 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.110.13:37311 (size: 85.1 KiB, free: 912.2 MiB) 24/05/11 13:42:38 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1535 24/05/11 13:42:38 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[6] at runColumnar at GpuDataWritingCommandExec.scala:117) (first 15 tasks are for partitions Vector(0)) 24/05/11 13:42:38 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks resource profile 0


The task keeps getting stuck here, I don't know why there is this warning !Exec cannot run on GPU because GpuCSVScan only supports UTF8 encoded data. I make sure the system encode is UTF8 with this cmd: (base) [root@localhost 14:04 /home/gpu/spark-rapids-benchmarks/nds]# locale LANG=en_US.utf8 LC_CTYPE="en_US.utf8" LC_NUMERIC="en_US.utf8" LC_TIME="en_US.utf8" LC_COLLATE="en_US.utf8" LC_MONETARY="en_US.utf8" LC_MESSAGES="en_US.utf8" LC_PAPER="en_US.utf8" LC_NAME="en_US.utf8" LC_ADDRESS="en_US.utf8" LC_TELEPHONE="en_US.utf8" LC_MEASUREMENT="en_US.utf8" LC_IDENTIFICATION="en_US.utf8"

Can anybody help me? Thanks so much.

GaryShen2008 commented 4 months ago

Hi @chasen918, Thanks for reporting this issue. The warning is just to say CSVScan won't be ran on GPU, it will fall back to CPU CSVScan plan so it shouldn't block the spark job running. From the command, you used a local mode to run the job but setting --conf spark.executor.instances=4 for 4 executors, which means requesting 4 GPUs. How many GPUs do you have in your node?

It's probably to remove the spark.executor.instances config then check if you can run successfully.

GaryShen2008 commented 4 months ago

About the encoding problem, currently we only support UTF8 in CSV Scan on GPU. From the code, the charset in the option is probably not UTF8.

Is that possible to check the environment tab in the spark event log page, search if there's any encoding or charset value set in your spark?

chasen918 commented 4 months ago

About the encoding problem, currently we only support UTF8 in CSV Scan on GPU. From the code, the charset in the option is probably not UTF8.

Is that possible to check the environment tab in the spark event log page, search if there's any encoding or charset value set in your spark?

@GaryShen2008 Thanks much for your help😊. But right now, I haven't had a chance to delve deep into the code yet... I just want to run some tests with my GPU first. I had tried some changes list below in my work to solve this issue, but they seem not work for this issue. Pls see:

There are no encode errors from spark event log page, I think spark log may not show the charset encode information.

Then, I use command file -i to get encoding format of each data file, I got these formats: iso-8859-1、us-ascii、utf8. I haven't made any encoding settings in my CentOS 7 system. This data was generated directly using the command python nds_gen_data.py hdfs 100 100 /data/raw_sf100 --overwrite_output. However, based on your description, it seems like the data source must be entirely in UTF-8 for it to function correctly, ...I can't make sure🙃.

I used the 'iconv' tool to transcode ISO-8859-1 to UTF-8, but data in UC-ASCII format cannot be treated this way since UC-ASCII is a subset of UTF-8. Additionally, I couldn't find any encoding-related settings in the 'nds_gen_data.py' script.

So, currently, I don't have a way to convert all datasets to UTF-8 format to pass verification of GPUScan. Could you please provide me with some assistance?

Furthermore, I only hava a GPU, so I reset spark.executor.instances=1, but the job still got stuck at the submission stage. Since GPU wasn't supported, it should have switched to CPU, so I couldn't understand why😓. Then, by excluding the GPU settings using the configuration in 'convert_submit_cpu.template', the job ran smoothly🤗.

Thanks Chasen

jlowe commented 4 months ago

Then, by excluding the GPU settings using the configuration in 'convert_submit_cpu.template', the job ran smoothly🤗.

You're running in local mode, and Spark will hang in local mode when trying to ask Spark for resources. Remove the GPU resource requests (i.e.: spark.executor.resource. and spark.task.resource.) from the configs. The RAPIDS Accelerator should locate the one GPU you have and use it, even if Spark didn't explicitly schedule for it.

As for the GPU CSV scan fallback, like @GaryShen2008 mentioned the fallback occurs because the CSV parser was passed a charset or encoding option that was not UTF-8. This is happening explicitly in nds_transcode.py. where it sets the encoding to ISO-8859-1. I'm guessing that's what the TPC-DS dsdgen tool is emitting when it creates the CSV data. The GPU does not support ISO-8859-1 which is almost a complete subset of UTF-8 but not quite. As you mentioned above, if you really want the transcode to be performed by the GPU, you can use iconv to convert from iso-8859-1 to utf-8 and then modify nds_transcode.py to remove the .option("encoding", "ISO-8859-1") setting when running on the converted data.

As mentioned above, it's not critical that the CSV read occurs on the GPU during the data transcoding from CSV to Parquet. Once it's in Parquet, strings will be UTF-8 and the GPU will read the Parquet files just fine for all the queries. So if you're just concerned about running NDS queries on the GPU, don't worry about the CSV fallback during the one-time transcode from CSV to Parquet.

chasen918 commented 4 months ago

Then, by excluding the GPU settings using the configuration in 'convert_submit_cpu.template', the job ran smoothly🤗.

You're running in local mode, and Spark will hang in local mode when trying to ask Spark for resources. Remove the GPU resource requests (i.e.: spark.executor.resource. and spark.task.resource.) from the configs. The RAPIDS Accelerator should locate the one GPU you have and use it, even if Spark didn't explicitly schedule for it.

As for the GPU CSV scan fallback, like @GaryShen2008 mentioned the fallback occurs because the CSV parser was passed a charset or encoding option that was not UTF-8. This is happening explicitly in nds_transcode.py. where it sets the encoding to ISO-8859-1. I'm guessing that's what the TPC-DS dsdgen tool is emitting when it creates the CSV data. The GPU does not support ISO-8859-1 which is almost a complete subset of UTF-8 but not quite. As you mentioned above, if you really want the transcode to be performed by the GPU, you can use iconv to convert from iso-8859-1 to utf-8 and then modify nds_transcode.py to remove the .option("encoding", "ISO-8859-1") setting when running on the converted data.

As mentioned above, it's not critical that the CSV read occurs on the GPU during the data transcoding from CSV to Parquet. Once it's in Parquet, strings will be UTF-8 and the GPU will read the Parquet files just fine for all the queries. So if you're just concerned about running NDS queries on the GPU, don't worry about the CSV fallback during the one-time transcode from CSV to Parquet.

Thank jlowe,

Based on your advice, I have successfully converted the data using CPU and have been using it in subsequent tests. As you said, I understand there shouldn't be any issues. Emm, however, when I tried running the power_run command with my GPU, I met an internal error in Spark😰. Here's the error log:

====== Creating TempView for table customer_address ======
Time taken: 13961 millis for table customer_address
====== Creating TempView for table customer_demographics ======
Time taken: 89 millis for table customer_demographics
====== Creating TempView for table date_dim ======
Time taken: 100 millis for table date_dim
====== Creating TempView for table warehouse ======
Time taken: 74 millis for table warehouse
====== Creating TempView for table ship_mode ======
Time taken: 68 millis for table ship_mode
====== Creating TempView for table time_dim ======
Time taken: 83 millis for table time_dim
====== Creating TempView for table reason ======
Time taken: 59 millis for table reason
====== Creating TempView for table income_band ======
Time taken: 60 millis for table income_band
====== Creating TempView for table item ======
Time taken: 61 millis for table item
====== Creating TempView for table store ======
Time taken: 65 millis for table store
====== Creating TempView for table call_center ======
Time taken: 63 millis for table call_center
====== Creating TempView for table customer ======
Time taken: 58 millis for table customer
====== Creating TempView for table web_site ======
Time taken: 66 millis for table web_site
====== Creating TempView for table store_returns ======
Time taken: 10495 millis for table store_returns
====== Creating TempView for table household_demographics ======
Time taken: 49 millis for table household_demographics
====== Creating TempView for table web_page ======
Time taken: 48 millis for table web_page
====== Creating TempView for table promotion ======
Time taken: 54 millis for table promotion
====== Creating TempView for table catalog_page ======
Time taken: 43 millis for table catalog_page
====== Creating TempView for table inventory ======
Time taken: 1007 millis for table inventory
====== Creating TempView for table catalog_returns ======
Time taken: 6872 millis for table catalog_returns
====== Creating TempView for table web_returns ======
Time taken: 6632 millis for table web_returns
====== Creating TempView for table web_sales ======
Time taken: 5108 millis for table web_sales
====== Creating TempView for table catalog_sales ======
Time taken: 5007 millis for table catalog_sales
====== Creating TempView for table store_sales ======
Time taken: 5015 millis for table store_sales
====== Run query96 ======
TaskFailureListener is registered.
ERROR BEGIN
An error occurred while calling o208.collectToPython.
: java.lang.OutOfMemoryError: Java heap space

File "/home/tcn/chasen/gpu/spark-rapids-benchmarks/nds/PysparkBenchReport.py", line 88, in report_on
fn(*args)
File "/home/tcn/chasen/gpu/spark-rapids-benchmarks/nds/nds_power.py", line 132, in run_one_query
df.collect()
File "/home/tcn/chasen/gpu/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 1216, in collect
sock_info = self._jdf.collectToPython()
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tcn/chasen/gpu/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in call
return_value = get_return_value(
^^^^^^^^^^^^^^^^^
File "/home/tcn/chasen/gpu/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py", line 169, in deco
return f(*a, **kw)
^^^^^^^^^^^
File "/home/tcn/chasen/gpu/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value
raise Py4JJavaError(
ERROR END
Time taken: [18910784] millis for query96
====== Run query7 ======
TaskFailureListener is registered.
ERROR BEGIN
An error occurred while calling o271.collectToPython.
: org.apache.spark.SparkException: [INTERNAL_ERROR] The Spark SQL phase planning failed with an internal error. You hit a bug in Spark or the Spark plugins you use. Please, report this bug to the corresponding communities or vendors, and provide the full stack trace.
at org.apache.spark.SparkException$.internalError(SparkException.scala:88)
at org.apache.spark.sql.execution.QueryExecution$.toInternalError(QueryExecution.scala:516)
at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:528)
at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:202)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827)
at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:201)
at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:175)
at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:168)
at org.apache.spark.sql.execution.QueryExecution.simpleString(QueryExecution.scala:221)
at org.apache.spark.sql.execution.QueryExecution.org$apache$spark$sql$execution$QueryExecution$$explainString(QueryExecution.scala:266)
at org.apache.spark.sql.execution.QueryExecution.explainString(QueryExecution.scala:235)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:112)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:195)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:103)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:4204)
at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:4033)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.NullPointerException
at org.apache.spark.sql.rapids.GpuShuffleEnv$.isRapidsShuffleAvailable(GpuShuffleEnv.scala:118)
at org.apache.spark.sql.rapids.GpuShuffleEnv$.useMultiThreadedShuffle(GpuShuffleEnv.scala:141)
at com.nvidia.spark.rapids.GpuPartitioning.$init$(GpuPartitioning.scala:42)
at com.nvidia.spark.rapids.GpuHashPartitioningBase.(GpuHashPartitioningBase.scala:29)
at com.nvidia.spark.rapids.shims.GpuHashPartitioning.(GpuHashPartitioning.scala:42)
at com.nvidia.spark.rapids.GpuOverrides$$anon$201.convertToGpu(GpuOverrides.scala:3915)
at com.nvidia.spark.rapids.GpuOverrides$$anon$201.convertToGpu(GpuOverrides.scala:3897)
at org.apache.spark.sql.rapids.execution.GpuShuffleMetaBase.convertShuffleToGpu(GpuShuffleExchangeExecBase.scala:130)
at org.apache.spark.sql.rapids.execution.GpuShuffleMetaBase.convertToGpu(GpuShuffleExchangeExecBase.scala:125)
at org.apache.spark.sql.rapids.execution.GpuShuffleMetaBase.convertToGpu(GpuShuffleExchangeExecBase.scala:44)
at com.nvidia.spark.rapids.SparkPlanMeta.convertIfNeeded(RapidsMeta.scala:838)
at com.nvidia.spark.rapids.GpuBaseAggregateMeta.convertToGpu(GpuAggregateExec.scala:1205)
at com.nvidia.spark.rapids.GpuBaseAggregateMeta.convertToGpu(GpuAggregateExec.scala:1023)
at com.nvidia.spark.rapids.SparkPlanMeta.convertIfNeeded(RapidsMeta.scala:838)
at com.nvidia.spark.rapids.shims.Spark340PlusNonDBShims$$anon$2.convertToGpu(Spark340PlusNonDBShims.scala:93)
at com.nvidia.spark.rapids.shims.Spark340PlusNonDBShims$$anon$2.convertToGpu(Spark340PlusNonDBShims.scala:61)
at com.nvidia.spark.rapids.SparkPlanMeta.convertIfNeeded(RapidsMeta.scala:838)
at com.nvidia.spark.rapids.GpuOverrides$.com$nvidia$spark$rapids$GpuOverrides$$doConvertPlan(GpuOverrides.scala:4415)
at com.nvidia.spark.rapids.GpuOverrides.applyOverrides(GpuOverrides.scala:4760)
at com.nvidia.spark.rapids.GpuOverrides.$anonfun$applyWithContext$3(GpuOverrides.scala:4620)
at com.nvidia.spark.rapids.GpuOverrides$.logDuration(GpuOverrides.scala:454)
at com.nvidia.spark.rapids.GpuOverrides.$anonfun$applyWithContext$1(GpuOverrides.scala:4617)
at com.nvidia.spark.rapids.GpuOverrideUtil$.$anonfun$tryOverride$1(GpuOverrides.scala:4583)
at com.nvidia.spark.rapids.GpuOverrides.applyWithContext(GpuOverrides.scala:4637)
at com.nvidia.spark.rapids.GpuQueryStagePrepOverrides.$anonfun$apply$1(GpuOverrides.scala:4600)
at com.nvidia.spark.rapids.GpuOverrideUtil$.$anonfun$tryOverride$1(GpuOverrides.scala:4583)
at com.nvidia.spark.rapids.GpuQueryStagePrepOverrides.apply(GpuOverrides.scala:4603)
at com.nvidia.spark.rapids.GpuQueryStagePrepOverrides.apply(GpuOverrides.scala:4596)
at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec$.$anonfun$applyPhysicalRules$2(AdaptiveSparkPlanExec.scala:798)
at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
at scala.collection.immutable.List.foldLeft(List.scala:91)
at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec$.applyPhysicalRules(AdaptiveSparkPlanExec.scala:797)
at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$initialPlan$1(AdaptiveSparkPlanExec.scala:192)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827)
at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.(AdaptiveSparkPlanExec.scala:191)
at org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan.applyInternal(InsertAdaptiveSparkPlan.scala:66)
at org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan.apply(InsertAdaptiveSparkPlan.scala:44)
at org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan.apply(InsertAdaptiveSparkPlan.scala:41)
at org.apache.spark.sql.execution.QueryExecution$.$anonfun$prepareForExecution$1(QueryExecution.scala:457)
at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
at scala.collection.immutable.List.foldLeft(List.scala:91)
at org.apache.spark.sql.execution.QueryExecution$.prepareForExecution(QueryExecution.scala:456)
at org.apache.spark.sql.execution.QueryExecution.$anonfun$executedPlan$1(QueryExecution.scala:175)
at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$2(QueryExecution.scala:202)
at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:526)
... 27 more

File "/home/tcn/chasen/gpu/spark-rapids-benchmarks/nds/PysparkBenchReport.py", line 88, in report_on
fn(*args)
File "/home/tcn/chasen/gpu/spark-rapids-benchmarks/nds/nds_power.py", line 132, in run_one_query
df.collect()
File "/home/tcn/chasen/gpu/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 1216, in collect
sock_info = self._jdf.collectToPython()
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tcn/chasen/gpu/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in call
return_value = get_return_value(
^^^^^^^^^^^^^^^^^
File "/home/tcn/chasen/gpu/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py", line 169, in deco
return f(*a, **kw)
^^^^^^^^^^^
File "/home/tcn/chasen/gpu/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value
raise Py4JJavaError(
ERROR END
Time taken: [218] millis for query7

Starting from the error occurred in this query, every subsequent query has been encountering such Spark internal error prints, resulting in query failures. I don't know how to avoid this problem😰. Here is my task configuration:

base.template

export SPARK_MASTER=${SPARK_MASTER:-spark://192.168.110.13:7077}
export DRIVER_MEMORY=${DRIVER_MEMORY:-2G}
export EXECUTOR_CORES=${EXECUTOR_CORES:-4}
export NUM_EXECUTORS=${NUM_EXECUTORS:-1}
export EXECUTOR_MEMORY=${EXECUTOR_MEMORY:-12G}

# The NDS listener jar which is built in jvm_listener directory.
export NDS_LISTENER_JAR=${NDS_LISTENER_JAR:-./jvm_listener/target/nds-benchmark-listener-1.0-SNAPSHOT.jar}
# The spark-rapids jar which is required when running on GPU
export SPARK_RAPIDS_PLUGIN_JAR=${SPARK_RAPIDS_PLUGIN_JAR:-rapids-4-spark_2.12-24.02.0.jar}
export PYTHONPATH=$SPARK_HOME/python:`echo $SPARK_HOME/python/lib/py4j-*.zip`

power_run_gpu

source base.template
export CONCURRENT_GPU_TASKS=${CONCURRENT_GPU_TASKS:-2}
export SHUFFLE_PARTITIONS=${SHUFFLE_PARTITIONS:-200}

export SPARK_CONF=("--master" "${SPARK_MASTER}"
                   "--deploy-mode" "client"
                   "--conf" "spark.driver.maxResultSize=2GB"
                   "--conf" "spark.driver.memory=${DRIVER_MEMORY}"
                   "--conf" "spark.executor.cores=${EXECUTOR_CORES}"
                   "--conf" "spark.executor.instances=${NUM_EXECUTORS}"
                   "--conf" "spark.executor.memory=${EXECUTOR_MEMORY}"
                   "--conf" "spark.worker.resource.gpu.amount=1"
                   "--conf" "spark.sql.shuffle.partitions=${SHUFFLE_PARTITIONS}"
                   "--conf" "spark.sql.files.maxPartitionBytes=512b"
                   "--conf" "spark.sql.adaptive.enabled=true"
                   "--conf" "spark.executor.resource.gpu.amount=1"
                   "--conf" "spark.executor.resource.gpu.discoveryScript=/opt/sparkRapidsPlugin/getGpusResources.sh"
                   "--conf" "spark.task.resource.gpu.amount=1"
                   "--conf" "spark.plugins=com.nvidia.spark.SQLPlugin"
                   "--conf" "spark.rapids.memory.host.spillStorageSize=1G"
                   "--conf" "spark.rapids.memory.pinnedPool.size=1g"
                   "--conf" "spark.rapids.sql.concurrentGpuTasks=${CONCURRENT_GPU_TASKS}"
                   "--conf" "spark.dynamicAllocation.enabled=false"
                   "--files" "$SPARK_HOME/examples/src/main/scripts/getGpusResources.sh"
                   "--jars" "$SPARK_RAPIDS_PLUGIN_JAR,$NDS_LISTENER_JAR")`

command

./spark-submit-template power_run_gpu nds_power.py /mnt/parquet_sf100 query_streams/query_0.sql result/power_run/power_run_time_gpu.csv --property_file properties/aqe-on.properties

My master and worker are both located on the same machine, similar to standalone mode. I'm not sure if this is related to the issue. Can you provide me with some suggestions?

GaryShen2008 commented 4 months ago

From the log, you're using Spark 3.4.2, which 24.02.0 release doesn't support yet.

chasen918 commented 4 months ago

From the log, you're using Spark 3.4.2, which 24.02.0 release doesn't support yet.

  • If you want to test on 3.4.2, you can try the latest 24.04.1 release.
  • Or you can change to 3.4.1, which is supported by 24.02.0.

I tried switching between different versions of Spark and Rapids, but the tasks couldn't continue running and got different errors. Here's the error message. Group1 Spark3.4.1 + Rapids 24.02.0

====== Creating TempView for table customer_address ======
Time taken: 9287 millis for table customer_address
====== Creating TempView for table customer_demographics ======
Time taken: 102 millis for table customer_demographics
====== Creating TempView for table date_dim ======
Time taken: 93 millis for table date_dim
====== Creating TempView for table warehouse ======
Time taken: 67 millis for table warehouse
====== Creating TempView for table ship_mode ======
Time taken: 65 millis for table ship_mode
====== Creating TempView for table time_dim ======
Time taken: 67 millis for table time_dim
====== Creating TempView for table reason ======
Time taken: 59 millis for table reason
====== Creating TempView for table income_band ======
Time taken: 58 millis for table income_band
====== Creating TempView for table item ======
Time taken: 58 millis for table item
====== Creating TempView for table store ======
Time taken: 60 millis for table store
====== Creating TempView for table call_center ======
Time taken: 64 millis for table call_center
====== Creating TempView for table customer ======
Time taken: 55 millis for table customer
====== Creating TempView for table web_site ======
Time taken: 63 millis for table web_site
====== Creating TempView for table store_returns ======
Time taken: 8517 millis for table store_returns
====== Creating TempView for table household_demographics ======
Time taken: 46 millis for table household_demographics
====== Creating TempView for table web_page ======
Time taken: 45 millis for table web_page
====== Creating TempView for table promotion ======
Time taken: 44 millis for table promotion
====== Creating TempView for table catalog_page ======
Time taken: 42 millis for table catalog_page
====== Creating TempView for table inventory ======
Time taken: 972 millis for table inventory
====== Creating TempView for table catalog_returns ======
Time taken: 6552 millis for table catalog_returns
====== Creating TempView for table web_returns ======
Time taken: 6440 millis for table web_returns
====== Creating TempView for table web_sales ======
Time taken: 4907 millis for table web_sales
====== Creating TempView for table catalog_sales ======
Time taken: 4806 millis for table catalog_sales
====== Creating TempView for table store_sales ======
Time taken: 4447 millis for table store_sales
====== Run query96 ======
Traceback (most recent call last):
  File "/home/tcn/chasen/gpu/spark-rapids-benchmarks/nds/nds_power.py", line 396, in <module>
    run_query_stream(args.input_prefix,
  File "/home/tcn/chasen/gpu/spark-rapids-benchmarks/nds/nds_power.py", line 260, in run_query_stream
    summary = q_report.report_on(run_one_query,spark_session,
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tcn/chasen/gpu/spark-rapids-benchmarks/nds/PysparkBenchReport.py", line 80, in report_on
    listener.register()
  File "/home/tcn/chasen/gpu/spark-rapids-benchmarks/nds/python_listener/PythonListener.py", line 43, in register
    self.uuid = manager.register(self)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tcn/chasen/gpu/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
  File "/home/tcn/chasen/gpu/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py", line 169, in deco
  File "/home/tcn/chasen/gpu/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:com.nvidia.spark.rapids.listener.Manager.register.
: java.lang.NoSuchMethodError: scala.collection.immutable.Map$.apply(Lscala/collection/Seq;)Lscala/collection/GenMap;
    at com.nvidia.spark.rapids.listener.Manager$.<init>(Manager.scala:25)
    at com.nvidia.spark.rapids.listener.Manager$.<clinit>(Manager.scala)
    at com.nvidia.spark.rapids.listener.Manager.register(Manager.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
    at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
    at java.lang.Thread.run(Thread.java:750)

Group2 Spark 3.4.2 + Rapids v24.04

====== Creating TempView for table customer_address ======
Time taken: 9514 millis for table customer_address
====== Creating TempView for table customer_demographics ======
Time taken: 90 millis for table customer_demographics
====== Creating TempView for table date_dim ======
Time taken: 85 millis for table date_dim
====== Creating TempView for table warehouse ======
Time taken: 78 millis for table warehouse
====== Creating TempView for table ship_mode ======
Time taken: 89 millis for table ship_mode
====== Creating TempView for table time_dim ======
Time taken: 64 millis for table time_dim
====== Creating TempView for table reason ======
Time taken: 59 millis for table reason
====== Creating TempView for table income_band ======
Time taken: 64 millis for table income_band
====== Creating TempView for table item ======
Time taken: 64 millis for table item
====== Creating TempView for table store ======
Time taken: 61 millis for table store
====== Creating TempView for table call_center ======
Time taken: 61 millis for table call_center
====== Creating TempView for table customer ======
Time taken: 56 millis for table customer
====== Creating TempView for table web_site ======
Time taken: 60 millis for table web_site
====== Creating TempView for table store_returns ======
Time taken: 8827 millis for table store_returns
====== Creating TempView for table household_demographics ======
Time taken: 45 millis for table household_demographics
====== Creating TempView for table web_page ======
Time taken: 48 millis for table web_page
====== Creating TempView for table promotion ======
Time taken: 46 millis for table promotion
====== Creating TempView for table catalog_page ======
Time taken: 43 millis for table catalog_page
====== Creating TempView for table inventory ======
Time taken: 1009 millis for table inventory
====== Creating TempView for table catalog_returns ======
Time taken: 6840 millis for table catalog_returns
====== Creating TempView for table web_returns ======
Time taken: 6821 millis for table web_returns
====== Creating TempView for table web_sales ======
Time taken: 5070 millis for table web_sales
====== Creating TempView for table catalog_sales ======
Time taken: 4982 millis for table catalog_sales
====== Creating TempView for table store_sales ======
Time taken: 4947 millis for table store_sales
====== Run query96 ======
TaskFailureListener is registered.
ERROR BEGIN
An error occurred while calling o208.collectToPython.
: java.lang.OutOfMemoryError: GC overhead limit exceeded
    at java.lang.reflect.Array.newInstance(Array.java:75)
    at scala.reflect.ClassTag$GenericClassTag.newArray(ClassTag.scala:171)
    at scala.collection.mutable.ArrayBuilder$ofRef.mkArray(ArrayBuilder.scala:74)
    at scala.collection.mutable.ArrayBuilder$ofRef.resize(ArrayBuilder.scala:79)
    at scala.collection.mutable.ArrayBuilder$ofRef.sizeHint(ArrayBuilder.scala:84)
    at scala.collection.mutable.Builder.sizeHint(Builder.scala:82)
    at scala.collection.mutable.Builder.sizeHint$(Builder.scala:79)
    at scala.collection.mutable.ArrayBuilder.sizeHint(ArrayBuilder.scala:26)
    at scala.collection.TraversableLike.builder$1(TraversableLike.scala:282)
    at scala.collection.TraversableLike.map(TraversableLike.scala:285)
    at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
    at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
    at org.apache.spark.sql.execution.PartitionedFileUtil$.getBlockHosts(PartitionedFileUtil.scala:69)
    at org.apache.spark.sql.execution.PartitionedFileUtil$.$anonfun$splitFiles$1(PartitionedFileUtil.scala:39)
    at org.apache.spark.sql.execution.PartitionedFileUtil$.$anonfun$splitFiles$1$adapted(PartitionedFileUtil.scala:36)
    at org.apache.spark.sql.execution.PartitionedFileUtil$$$Lambda$3919/38916316.apply(Unknown Source)
    at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
    at scala.collection.TraversableLike$$Lambda$64/93199773.apply(Unknown Source)
    at scala.collection.immutable.NumericRange.foreach(NumericRange.scala:75)
    at scala.collection.TraversableLike.map(TraversableLike.scala:286)
    at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
    at scala.collection.AbstractTraversable.map(Traversable.scala:108)
    at org.apache.spark.sql.execution.PartitionedFileUtil$.splitFiles(PartitionedFileUtil.scala:36)
    at org.apache.spark.sql.execution.rapids.shims.FilePartitionShims$.$anonfun$splitFiles$5(FilePartitionShims.scala:107)
    at org.apache.spark.sql.execution.rapids.shims.FilePartitionShims$$$Lambda$3915/154373703.apply(Unknown Source)
    at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293)
    at scala.collection.TraversableLike$$Lambda$224/1676605578.apply(Unknown Source)
    at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
    at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
    at scala.collection.TraversableLike.flatMap(TraversableLike.scala:293)
    at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:290)

  File "/home/tcn/chasen/gpu/spark-rapids-benchmarks/nds/PysparkBenchReport.py", line 88, in report_on
    fn(*args)
  File "/home/tcn/chasen/gpu/spark-rapids-benchmarks/nds/nds_power.py", line 132, in run_one_query
    df.collect()
  File "/home/tcn/chasen/gpu/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 1216, in collect
    sock_info = self._jdf.collectToPython()
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tcn/chasen/gpu/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
    return_value = get_return_value(
                   ^^^^^^^^^^^^^^^^^
  File "/home/tcn/chasen/gpu/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py", line 169, in deco
    return f(*a, **kw)
           ^^^^^^^^^^^
  File "/home/tcn/chasen/gpu/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value
    raise Py4JJavaError(
ERROR END
Time taken: [243133] millis for query96
====== Run query7 ======
TaskFailureListener is registered.
ERROR BEGIN
An error occurred while calling o271.collectToPython.
: java.lang.OutOfMemoryError: GC overhead limit exceeded

  File "/home/tcn/chasen/gpu/spark-rapids-benchmarks/nds/PysparkBenchReport.py", line 88, in report_on
    fn(*args)
  File "/home/tcn/chasen/gpu/spark-rapids-benchmarks/nds/nds_power.py", line 132, in run_one_query
    df.collect()
  File "/home/tcn/chasen/gpu/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 1216, in collect
    sock_info = self._jdf.collectToPython()
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tcn/chasen/gpu/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
    return_value = get_return_value(
                   ^^^^^^^^^^^^^^^^^
  File "/home/tcn/chasen/gpu/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py", line 169, in deco
    return f(*a, **kw)
           ^^^^^^^^^^^
  File "/home/tcn/chasen/gpu/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value
    raise Py4JJavaError(
ERROR END
Time taken: [8756560] millis for query7
====== Run query75 ======
TaskFailureListener is registered.
ERROR BEGIN
An error occurred while calling o334.collectToPython.
: org.apache.spark.SparkException: [INTERNAL_ERROR] The Spark SQL phase planning failed with an internal error. You hit a bug in Spark or the Spark plugins you use. Please, report this bug to the corresponding communities or vendors, and provide the full stack trace.
    at org.apache.spark.SparkException$.internalError(SparkException.scala:88)
    at org.apache.spark.sql.execution.QueryExecution$.toInternalError(QueryExecution.scala:516)
    at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:528)
    at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:202)
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827)
    at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:201)
    at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:175)
    at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:168)
    at org.apache.spark.sql.execution.QueryExecution.simpleString(QueryExecution.scala:221)
    at org.apache.spark.sql.execution.QueryExecution.org$apache$spark$sql$execution$QueryExecution$$explainString(QueryExecution.scala:266)
    at org.apache.spark.sql.execution.QueryExecution.explainString(QueryExecution.scala:235)
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:112)
    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:195)
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:103)
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
    at org.apache.spark.sql.Dataset.withAction(Dataset.scala:4204)
    at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:4033)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
    at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
    at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.NullPointerException
    at org.apache.spark.sql.rapids.GpuShuffleEnv$.isRapidsShuffleAvailable(GpuShuffleEnv.scala:118)
    at org.apache.spark.sql.rapids.GpuShuffleEnv$.useMultiThreadedShuffle(GpuShuffleEnv.scala:141)
    at com.nvidia.spark.rapids.GpuPartitioning.$init$(GpuPartitioning.scala:42)
    at com.nvidia.spark.rapids.GpuHashPartitioningBase.<init>(GpuHashPartitioningBase.scala:29)
    at com.nvidia.spark.rapids.shims.GpuHashPartitioning.<init>(GpuHashPartitioning.scala:42)
    at com.nvidia.spark.rapids.GpuOverrides$$anon$200.convertToGpu(GpuOverrides.scala:3929)
    at com.nvidia.spark.rapids.GpuOverrides$$anon$200.convertToGpu(GpuOverrides.scala:3911)
    at org.apache.spark.sql.rapids.execution.GpuShuffleMetaBase.convertShuffleToGpu(GpuShuffleExchangeExecBase.scala:130)
    at org.apache.spark.sql.rapids.execution.GpuShuffleMetaBase.convertToGpu(GpuShuffleExchangeExecBase.scala:125)
    at org.apache.spark.sql.rapids.execution.GpuShuffleMetaBase.convertToGpu(GpuShuffleExchangeExecBase.scala:44)
    at com.nvidia.spark.rapids.SparkPlanMeta.convertIfNeeded(RapidsMeta.scala:838)
    at com.nvidia.spark.rapids.SparkPlanMeta.convertIfNeeded(RapidsMeta.scala:835)
    at com.nvidia.spark.rapids.GpuSortMergeJoinMeta.$anonfun$convertToGpu$3(GpuSortMergeJoinMeta.scala:84)
    at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
    at scala.collection.Iterator.foreach(Iterator.scala:943)
    at scala.collection.Iterator.foreach$(Iterator.scala:943)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
    at scala.collection.IterableLike.foreach(IterableLike.scala:74)
    at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
    at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
    at scala.collection.TraversableLike.map(TraversableLike.scala:286)
    at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
    at scala.collection.AbstractTraversable.map(Traversable.scala:108)
    at com.nvidia.spark.rapids.GpuSortMergeJoinMeta.convertToGpu(GpuSortMergeJoinMeta.scala:84)
    at com.nvidia.spark.rapids.GpuSortMergeJoinMeta.convertToGpu(GpuSortMergeJoinMeta.scala:24)
    at com.nvidia.spark.rapids.SparkPlanMeta.convertIfNeeded(RapidsMeta.scala:838)
    at com.nvidia.spark.rapids.GpuProjectExecMeta.convertToGpu(basicPhysicalOperators.scala:54)
    at com.nvidia.spark.rapids.GpuProjectExecMeta.convertToGpu(basicPhysicalOperators.scala:45)
    at com.nvidia.spark.rapids.SparkPlanMeta.convertIfNeeded(RapidsMeta.scala:838)
    at org.apache.spark.sql.rapids.execution.GpuShuffleMetaBase.convertToGpu(GpuShuffleExchangeExecBase.scala:123)
    at org.apache.spark.sql.rapids.execution.GpuShuffleMetaBase.convertToGpu(GpuShuffleExchangeExecBase.scala:44)
    at com.nvidia.spark.rapids.SparkPlanMeta.convertIfNeeded(RapidsMeta.scala:838)
    at com.nvidia.spark.rapids.SparkPlanMeta.convertIfNeeded(RapidsMeta.scala:835)
    at com.nvidia.spark.rapids.GpuSortMergeJoinMeta.$anonfun$convertToGpu$3(GpuSortMergeJoinMeta.scala:84)
    at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
    at scala.collection.Iterator.foreach(Iterator.scala:943)
    at scala.collection.Iterator.foreach$(Iterator.scala:943)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
    at scala.collection.IterableLike.foreach(IterableLike.scala:74)
    at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
    at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
    at scala.collection.TraversableLike.map(TraversableLike.scala:286)
    at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
    at scala.collection.AbstractTraversable.map(Traversable.scala:108)
    at com.nvidia.spark.rapids.GpuSortMergeJoinMeta.convertToGpu(GpuSortMergeJoinMeta.scala:84)
    at com.nvidia.spark.rapids.GpuSortMergeJoinMeta.convertToGpu(GpuSortMergeJoinMeta.scala:24)
    at com.nvidia.spark.rapids.SparkPlanMeta.convertIfNeeded(RapidsMeta.scala:838)
    at com.nvidia.spark.rapids.GpuProjectExecMeta.convertToGpu(basicPhysicalOperators.scala:54)
    at com.nvidia.spark.rapids.GpuProjectExecMeta.convertToGpu(basicPhysicalOperators.scala:45)
    at com.nvidia.spark.rapids.SparkPlanMeta.convertIfNeeded(RapidsMeta.scala:838)
    at com.nvidia.spark.rapids.GpuOverrides$$anon$211.$anonfun$convertToGpu$30(GpuOverrides.scala:4187)
    at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
    at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
    at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
    at scala.collection.TraversableLike.map(TraversableLike.scala:286)
    at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
    at scala.collection.AbstractTraversable.map(Traversable.scala:108)
    at com.nvidia.spark.rapids.GpuOverrides$$anon$211.convertToGpu(GpuOverrides.scala:4187)
    at com.nvidia.spark.rapids.GpuOverrides$$anon$211.convertToGpu(GpuOverrides.scala:4185)
    at com.nvidia.spark.rapids.SparkPlanMeta.convertIfNeeded(RapidsMeta.scala:838)
    at com.nvidia.spark.rapids.GpuBaseAggregateMeta.convertToGpu(GpuAggregateExec.scala:1205)
    at com.nvidia.spark.rapids.GpuBaseAggregateMeta.convertToGpu(GpuAggregateExec.scala:1023)
    at com.nvidia.spark.rapids.SparkPlanMeta.convertIfNeeded(RapidsMeta.scala:838)
    at org.apache.spark.sql.rapids.execution.GpuShuffleMetaBase.convertToGpu(GpuShuffleExchangeExecBase.scala:123)
    at org.apache.spark.sql.rapids.execution.GpuShuffleMetaBase.convertToGpu(GpuShuffleExchangeExecBase.scala:44)
    at com.nvidia.spark.rapids.SparkPlanMeta.convertIfNeeded(RapidsMeta.scala:838)
    at com.nvidia.spark.rapids.GpuBaseAggregateMeta.convertToGpu(GpuAggregateExec.scala:1205)
    at com.nvidia.spark.rapids.GpuBaseAggregateMeta.convertToGpu(GpuAggregateExec.scala:1023)
    at com.nvidia.spark.rapids.SparkPlanMeta.convertIfNeeded(RapidsMeta.scala:838)
    at com.nvidia.spark.rapids.GpuBaseAggregateMeta.convertToGpu(GpuAggregateExec.scala:1205)
    at com.nvidia.spark.rapids.GpuBaseAggregateMeta.convertToGpu(GpuAggregateExec.scala:1023)
    at com.nvidia.spark.rapids.SparkPlanMeta.convertIfNeeded(RapidsMeta.scala:838)
    at org.apache.spark.sql.rapids.execution.GpuShuffleMetaBase.convertToGpu(GpuShuffleExchangeExecBase.scala:123)
    at org.apache.spark.sql.rapids.execution.GpuShuffleMetaBase.convertToGpu(GpuShuffleExchangeExecBase.scala:44)
    at com.nvidia.spark.rapids.SparkPlanMeta.convertIfNeeded(RapidsMeta.scala:838)
    at com.nvidia.spark.rapids.GpuBaseAggregateMeta.convertToGpu(GpuAggregateExec.scala:1205)
    at com.nvidia.spark.rapids.GpuBaseAggregateMeta.convertToGpu(GpuAggregateExec.scala:1023)
    at com.nvidia.spark.rapids.SparkPlanMeta.convertIfNeeded(RapidsMeta.scala:838)
    at com.nvidia.spark.rapids.GpuOverrides$$anon$210.convertToGpu(GpuOverrides.scala:4162)
    at com.nvidia.spark.rapids.GpuOverrides$$anon$210.convertToGpu(GpuOverrides.scala:4159)
    at com.nvidia.spark.rapids.SparkPlanMeta.convertIfNeeded(RapidsMeta.scala:838)
    at org.apache.spark.sql.rapids.execution.GpuShuffleMetaBase.convertToGpu(GpuShuffleExchangeExecBase.scala:123)
    at org.apache.spark.sql.rapids.execution.GpuShuffleMetaBase.convertToGpu(GpuShuffleExchangeExecBase.scala:44)
    at com.nvidia.spark.rapids.SparkPlanMeta.convertIfNeeded(RapidsMeta.scala:838)
    at com.nvidia.spark.rapids.SparkPlanMeta.convertIfNeeded(RapidsMeta.scala:835)
    at com.nvidia.spark.rapids.GpuSortMergeJoinMeta.$anonfun$convertToGpu$3(GpuSortMergeJoinMeta.scala:84)
    at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
    at scala.collection.Iterator.foreach(Iterator.scala:943)
    at scala.collection.Iterator.foreach$(Iterator.scala:943)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
    at scala.collection.IterableLike.foreach(IterableLike.scala:74)
    at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
    at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
    at scala.collection.TraversableLike.map(TraversableLike.scala:286)
    at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
    at scala.collection.AbstractTraversable.map(Traversable.scala:108)
    at com.nvidia.spark.rapids.GpuSortMergeJoinMeta.convertToGpu(GpuSortMergeJoinMeta.scala:84)
    at com.nvidia.spark.rapids.GpuSortMergeJoinMeta.convertToGpu(GpuSortMergeJoinMeta.scala:24)
    at com.nvidia.spark.rapids.SparkPlanMeta.convertIfNeeded(RapidsMeta.scala:838)
    at com.nvidia.spark.rapids.GpuProjectExecMeta.convertToGpu(basicPhysicalOperators.scala:54)
    at com.nvidia.spark.rapids.GpuProjectExecMeta.convertToGpu(basicPhysicalOperators.scala:45)
    at com.nvidia.spark.rapids.SparkPlanMeta.convertIfNeeded(RapidsMeta.scala:838)
    at com.nvidia.spark.rapids.shims.Spark340PlusNonDBShims$$anon$2.convertToGpu(Spark340PlusNonDBShims.scala:93)
    at com.nvidia.spark.rapids.shims.Spark340PlusNonDBShims$$anon$2.convertToGpu(Spark340PlusNonDBShims.scala:61)
    at com.nvidia.spark.rapids.SparkPlanMeta.convertIfNeeded(RapidsMeta.scala:838)
    at com.nvidia.spark.rapids.GpuOverrides$.com$nvidia$spark$rapids$GpuOverrides$$doConvertPlan(GpuOverrides.scala:4429)
    at com.nvidia.spark.rapids.GpuOverrides.applyOverrides(GpuOverrides.scala:4774)
    at com.nvidia.spark.rapids.GpuOverrides.$anonfun$applyWithContext$3(GpuOverrides.scala:4634)
    at com.nvidia.spark.rapids.GpuOverrides$.logDuration(GpuOverrides.scala:455)
    at com.nvidia.spark.rapids.GpuOverrides.$anonfun$applyWithContext$1(GpuOverrides.scala:4631)
    at com.nvidia.spark.rapids.GpuOverrideUtil$.$anonfun$tryOverride$1(GpuOverrides.scala:4597)
    at com.nvidia.spark.rapids.GpuOverrides.applyWithContext(GpuOverrides.scala:4651)
    at com.nvidia.spark.rapids.GpuQueryStagePrepOverrides.$anonfun$apply$1(GpuOverrides.scala:4614)
    at com.nvidia.spark.rapids.GpuOverrideUtil$.$anonfun$tryOverride$1(GpuOverrides.scala:4597)
    at com.nvidia.spark.rapids.GpuQueryStagePrepOverrides.apply(GpuOverrides.scala:4617)
    at com.nvidia.spark.rapids.GpuQueryStagePrepOverrides.apply(GpuOverrides.scala:4610)
    at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec$.$anonfun$applyPhysicalRules$2(AdaptiveSparkPlanExec.scala:798)
    at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
    at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
    at scala.collection.immutable.List.foldLeft(List.scala:91)
    at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec$.applyPhysicalRules(AdaptiveSparkPlanExec.scala:797)
    at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$initialPlan$1(AdaptiveSparkPlanExec.scala:192)
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827)
    at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.<init>(AdaptiveSparkPlanExec.scala:191)
    at org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan.applyInternal(InsertAdaptiveSparkPlan.scala:66)
    at org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan.apply(InsertAdaptiveSparkPlan.scala:44)
    at org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan.apply(InsertAdaptiveSparkPlan.scala:41)
    at org.apache.spark.sql.execution.QueryExecution$.$anonfun$prepareForExecution$1(QueryExecution.scala:457)
    at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
    at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
    at scala.collection.immutable.List.foldLeft(List.scala:91)
    at org.apache.spark.sql.execution.QueryExecution$.prepareForExecution(QueryExecution.scala:456)
    at org.apache.spark.sql.execution.QueryExecution.$anonfun$executedPlan$1(QueryExecution.scala:175)
    at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
    at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$2(QueryExecution.scala:202)
    at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:526)
    ... 27 more

  File "/home/tcn/chasen/gpu/spark-rapids-benchmarks/nds/PysparkBenchReport.py", line 88, in report_on
    fn(*args)
  File "/home/tcn/chasen/gpu/spark-rapids-benchmarks/nds/nds_power.py", line 132, in run_one_query
    df.collect()
  File "/home/tcn/chasen/gpu/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 1216, in collect
    sock_info = self._jdf.collectToPython()
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tcn/chasen/gpu/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
    return_value = get_return_value(
                   ^^^^^^^^^^^^^^^^^
  File "/home/tcn/chasen/gpu/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py", line 169, in deco
    return f(*a, **kw)
           ^^^^^^^^^^^
  File "/home/tcn/chasen/gpu/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value
    raise Py4JJavaError(
ERROR END

In this group, the tasks ran for over two hours before encountering errors and quitting. I noticed there were two types of errors present:


I also tried different version combinations (Spark 3.4.1 & Spark 3.4.2 & Rapids 24.02 & Rapids 24.04), and they encountered the errors mentioned in group1 or group2😓. Does this relate to my parameter configuration, or is it another issue?

chasen918 commented 4 months ago

Also, I'm not sure if this should be included in this ticket, but there's a minor issue: running the example command for Data Maintenance directly encounters an error, because nds_maintenance.py doesn't support--data_format orc... And when I run data_maintenance test, I hit this issue:

Traceback (most recent call last):
  File "/home/tcn/chasen/gpu/spark-rapids-benchmarks/nds/nds_maintenance.py", line 313, in <module>
    register_temp_views(spark_session, args.refresh_data_path)
  File "/home/tcn/chasen/gpu/spark-rapids-benchmarks/nds/nds_maintenance.py", line 274, in register_temp_views
    "header", "false").csv(refresh_data_path + '/' + table, schema=schema).createOrReplaceTempView(table)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tcn/chasen/gpu/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 727, in csv
  File "/home/tcn/chasen/gpu/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
  File "/home/tcn/chasen/gpu/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py", line 175, in deco
pyspark.errors.exceptions.captured.AnalysisException: [PATH_NOT_FOUND] Path does not exist: file:/home/tcn/chasen/gpu/spark-rapids-benchmarks/nds/data_maintenance/s_purchase_lineitem.

I knew there were only some sqls in this folder, so the issue made me confused.

GaryShen2008 commented 4 months ago

I just ran a quick NDS with similar configurations locally with Spark 3.4.2 and Spark-Rapids v24.04.0.

one little problem as below, but I don't think it's the blocking ones. "--conf" "spark.sql.files.maxPartitionBytes=512b" 512 bytes? I changed to 512m

source base.template
export CONCURRENT_GPU_TASKS=${CONCURRENT_GPU_TASKS:-2}
export SHUFFLE_PARTITIONS=${SHUFFLE_PARTITIONS:-200}

export SPARK_CONF=("--master" "${SPARK_MASTER}"
                   "--deploy-mode" "client"
                   "--conf" "spark.driver.maxResultSize=2GB"
                   "--conf" "spark.driver.memory=${DRIVER_MEMORY}"
                   "--conf" "spark.executor.cores=${EXECUTOR_CORES}"
                   "--conf" "spark.executor.instances=${NUM_EXECUTORS}"
                   "--conf" "spark.executor.memory=${EXECUTOR_MEMORY}"
                   "--conf" "spark.sql.shuffle.partitions=${SHUFFLE_PARTITIONS}"
                   "--conf" "spark.sql.files.maxPartitionBytes=512m"
                   "--conf" "spark.sql.adaptive.enabled=true"
                   "--conf" "spark.executor.resource.gpu.amount=1"
                   "--conf" "spark.executor.resource.gpu.discoveryScript=./getGpusResources.sh"
                   "--conf" "spark.task.resource.gpu.amount=1"
                   "--conf" "spark.plugins=com.nvidia.spark.SQLPlugin"
                   "--conf" "spark.rapids.memory.host.spillStorageSize=1G"
                   "--conf" "spark.rapids.memory.pinnedPool.size=1g"
                   "--conf" "spark.rapids.sql.concurrentGpuTasks=${CONCURRENT_GPU_TASKS}"
                   "--conf" "spark.dynamicAllocation.enabled=false"
                   "--files" "$SPARK_HOME/examples/src/main/scripts/getGpusResources.sh"
                   "--jars" "$SPARK_RAPIDS_PLUGIN_JAR,$NDS_LISTENER_JAR")

Can you share more detailed steps you did? Just to confirm, Is your Spark downloaded from Apache Spark page and not a customized version? And Spark-Rapids jar is downloaded from maven central? (Something weird, your 24.02.0 contains a shim layer for Spark 3.4.2, which we didn't release.)

chasen918 commented 4 months ago

I just ran a quick NDS with similar configurations locally with Spark 3.4.2 and Spark-Rapids v24.04.0.

one little problem as below, but I don't think it's the blocking ones. "--conf" "spark.sql.files.maxPartitionBytes=512b" 512 bytes? I changed to 512m

source base.template
export CONCURRENT_GPU_TASKS=${CONCURRENT_GPU_TASKS:-2}
export SHUFFLE_PARTITIONS=${SHUFFLE_PARTITIONS:-200}

export SPARK_CONF=("--master" "${SPARK_MASTER}"
                   "--deploy-mode" "client"
                   "--conf" "spark.driver.maxResultSize=2GB"
                   "--conf" "spark.driver.memory=${DRIVER_MEMORY}"
                   "--conf" "spark.executor.cores=${EXECUTOR_CORES}"
                   "--conf" "spark.executor.instances=${NUM_EXECUTORS}"
                   "--conf" "spark.executor.memory=${EXECUTOR_MEMORY}"
                   "--conf" "spark.sql.shuffle.partitions=${SHUFFLE_PARTITIONS}"
                   "--conf" "spark.sql.files.maxPartitionBytes=512m"
                   "--conf" "spark.sql.adaptive.enabled=true"
                   "--conf" "spark.executor.resource.gpu.amount=1"
                   "--conf" "spark.executor.resource.gpu.discoveryScript=./getGpusResources.sh"
                   "--conf" "spark.task.resource.gpu.amount=1"
                   "--conf" "spark.plugins=com.nvidia.spark.SQLPlugin"
                   "--conf" "spark.rapids.memory.host.spillStorageSize=1G"
                   "--conf" "spark.rapids.memory.pinnedPool.size=1g"
                   "--conf" "spark.rapids.sql.concurrentGpuTasks=${CONCURRENT_GPU_TASKS}"
                   "--conf" "spark.dynamicAllocation.enabled=false"
                   "--files" "$SPARK_HOME/examples/src/main/scripts/getGpusResources.sh"
                   "--jars" "$SPARK_RAPIDS_PLUGIN_JAR,$NDS_LISTENER_JAR")

Can you share more detailed steps you did? Just to confirm, Is your Spark downloaded from Apache Spark page and not a customized version? And Spark-Rapids jar is downloaded from maven central? (Something weird, your 24.02.0 contains a shim layer for Spark 3.4.2, which we didn't release.)

Hi @GaryShen2008 , My spark is download from https://spark.apache.org/downloads.html , and rapids jar is from https://nvidia.github.io/spark-rapids/docs/download.html#release-v24040.

Here are some of my test steps:

  1. generate data with nds_gen_data.py: python nds_gen_data.py local 100 100 /data/raw_sf100 --overwrite_output
  2. then, convert data to parquet
    ./spark-submit-template convert_submit_gpu.template \
    nds_transcode.py  /data/raw_sf100  parquet_sf3k report.txt
  3. generate query streams and build nds-benchmark-listener-1.0-SNAPSHOT.jar
  4. Last, I use this command below to setup power_run test
    ./spark-submit-template power_run_gpu.template \
    nds_power.py \
    parquet_sf3k \
    ./nds_query_streams/query_0.sql \
    time.csv \
    --property_file properties/aqe-on.properties

    This is my confiugation: Base.template

    
    export SPARK_MASTER=${SPARK_MASTER:-spark://192.168.110.13:7077}
    export DRIVER_MEMORY=${DRIVER_MEMORY:-2G}
    export EXECUTOR_CORES=${EXECUTOR_CORES:-4}
    export NUM_EXECUTORS=${NUM_EXECUTORS:-1}
    export EXECUTOR_MEMORY=${EXECUTOR_MEMORY:-12G}

The NDS listener jar which is built in jvm_listener directory.

export NDS_LISTENER_JAR=${NDS_LISTENER_JAR:-./jvm_listener/target/nds-benchmark-listener-1.0-SNAPSHOT.jar}

The spark-rapids jar which is required when running on GPU

export SPARK_RAPIDS_PLUGIN_JAR=${SPARK_RAPIDS_PLUGIN_JAR:-rapids-4-spark_2.12-24.04.0.jar} export PYTHONPATH=$SPARK_HOME/python:echo $SPARK_HOME/python/lib/py4j-*.zip

**power_run_gpu**

source base.template export CONCURRENT_GPU_TASKS=${CONCURRENT_GPU_TASKS:-2} export SHUFFLE_PARTITIONS=${SHUFFLE_PARTITIONS:-200}

export SPARK_CONF=("--master" "${SPARK_MASTER}" "--deploy-mode" "client" "--conf" "spark.driver.maxResultSize=2GB" "--conf" "spark.driver.memory=${DRIVER_MEMORY}" "--conf" "spark.executor.cores=${EXECUTOR_CORES}" "--conf" "spark.executor.instances=${NUM_EXECUTORS}" "--conf" "spark.executor.memory=${EXECUTOR_MEMORY}" "--conf" "spark.worker.resource.gpu.amount=1" "--conf" "spark.sql.shuffle.partitions=${SHUFFLE_PARTITIONS}" "--conf" "spark.sql.files.maxPartitionBytes=512b" "--conf" "spark.sql.adaptive.enabled=true" "--conf" "spark.executor.resource.gpu.amount=1" "--conf" "spark.executor.resource.gpu.discoveryScript=/opt/sparkRapidsPlugin/getGpusResources.sh" "--conf" "spark.task.resource.gpu.amount=1" "--conf" "spark.plugins=com.nvidia.spark.SQLPlugin" "--conf" "spark.rapids.memory.host.spillStorageSize=1G" "--conf" "spark.rapids.memory.pinnedPool.size=1g" "--conf" "spark.rapids.sql.concurrentGpuTasks=${CONCURRENT_GPU_TASKS}" "--conf" "spark.dynamicAllocation.enabled=false" "--files" "$SPARK_HOME/examples/src/main/scripts/getGpusResources.sh" "--jars" "$SPARK_RAPIDS_PLUGIN_JAR,$NDS_LISTENER_JAR")

**Spark.env.sh**

export SPARK_EXECUTOR_CORES=4 export SPARK_DRIVER_MEMORY=2G export SPARK_EXECUTOR_MEMORY=16G export SPARK_WORKER_CORES=4 export SPARK_WORKER_MERMORY=4G export HADOOP_CONF_DIR=/home/tcn/chasen/gpu/hadoop/etc/hadoop/ export YARN_CONF_DIR=/home/tcn/chasen/gpu/hadoop/etc/hadoop/ SPARK_WORKER_OPTS="-Dspark.worker.resource.gpu.amount=1 -Dspark.worker.resource.gpu.discoveryScript=/opt/sparkRapidsPlugin/getGpusResources.sh"



I'll adjust some parameters to verify. And attempt to download Rapids from maven  later. 
chasen918 commented 4 months ago

I just ran a quick NDS with similar configurations locally with Spark 3.4.2 and Spark-Rapids v24.04.0.

one little problem as below, but I don't think it's the blocking ones. "--conf" "spark.sql.files.maxPartitionBytes=512b" 512 bytes? I changed to 512m

source base.template
export CONCURRENT_GPU_TASKS=${CONCURRENT_GPU_TASKS:-2}
export SHUFFLE_PARTITIONS=${SHUFFLE_PARTITIONS:-200}

export SPARK_CONF=("--master" "${SPARK_MASTER}"
                   "--deploy-mode" "client"
                   "--conf" "spark.driver.maxResultSize=2GB"
                   "--conf" "spark.driver.memory=${DRIVER_MEMORY}"
                   "--conf" "spark.executor.cores=${EXECUTOR_CORES}"
                   "--conf" "spark.executor.instances=${NUM_EXECUTORS}"
                   "--conf" "spark.executor.memory=${EXECUTOR_MEMORY}"
                   "--conf" "spark.sql.shuffle.partitions=${SHUFFLE_PARTITIONS}"
                   "--conf" "spark.sql.files.maxPartitionBytes=512m"
                   "--conf" "spark.sql.adaptive.enabled=true"
                   "--conf" "spark.executor.resource.gpu.amount=1"
                   "--conf" "spark.executor.resource.gpu.discoveryScript=./getGpusResources.sh"
                   "--conf" "spark.task.resource.gpu.amount=1"
                   "--conf" "spark.plugins=com.nvidia.spark.SQLPlugin"
                   "--conf" "spark.rapids.memory.host.spillStorageSize=1G"
                   "--conf" "spark.rapids.memory.pinnedPool.size=1g"
                   "--conf" "spark.rapids.sql.concurrentGpuTasks=${CONCURRENT_GPU_TASKS}"
                   "--conf" "spark.dynamicAllocation.enabled=false"
                   "--files" "$SPARK_HOME/examples/src/main/scripts/getGpusResources.sh"
                   "--jars" "$SPARK_RAPIDS_PLUGIN_JAR,$NDS_LISTENER_JAR")

Can you share more detailed steps you did? Just to confirm, Is your Spark downloaded from Apache Spark page and not a customized version? And Spark-Rapids jar is downloaded from maven central? (Something weird, your 24.02.0 contains a shim layer for Spark 3.4.2, which we didn't release.)

Hi @GaryShen2008 ,

I used to run my work on a workstation before and met this issue, now, I run on a server with the same spark configuration, power_run and throughput_run can both run normally. Emm..., I guess the problem may be caused by insufficient resources on my workstation.

So I think the issue may be solved, thanks for your help.🤗

Also, I'm not sure if this should be included in this ticket, but there's a minor issue I found: running the example command for Data Maintenance directly encounters an error, because nds_maintenance.py doesn't support --data_format orc... And when I run data_maintenance test, I hit this issue:

Traceback (most recent call last):
  File "/home/tcn/chasen/gpu/spark-rapids-benchmarks/nds/nds_maintenance.py", line 313, in <module>
    register_temp_views(spark_session, args.refresh_data_path)
  File "/home/tcn/chasen/gpu/spark-rapids-benchmarks/nds/nds_maintenance.py", line 274, in register_temp_views
    "header", "false").csv(refresh_data_path + '/' + table, schema=schema).createOrReplaceTempView(table)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tcn/chasen/gpu/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 727, in csv
  File "/home/tcn/chasen/gpu/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
  File "/home/tcn/chasen/gpu/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py", line 175, in deco
pyspark.errors.exceptions.captured.AnalysisException: [PATH_NOT_FOUND] Path does not exist: file:/home/tcn/chasen/gpu/spark-rapids-benchmarks/nds/data_maintenance/s_purchase_lineitem.

I know there were only some sqls in data_maintenance folder, not the s_purchase_lineitem, I have no idea about this issue😓.

wjxiz1992 commented 4 months ago

Hi @chasen918 ,

The Data Maintenance will add/remove data to existing tables. The data used to do this needs to be generated by the Data Generation step by specifying --update. Can you please check if the data is already there or not?

GaryShen2008 commented 4 months ago

Hi @chasen918, Good to know the original issue has been solved on the new server. For data maintenance, according to above Alllen's answer, can you try again? If there's still a problem, you can file another issue. And I'll close this issue if you're ok.

chasen918 commented 4 months ago

Hi @wjxiz1992 @GaryShen2008,

I have not generated new update data for the main_maintenance at present. So, I will try this step again at another time. And yes, you can close this ticket.

Thanks for everyone's help🤗.