NVIDIA / spark-rapids-benchmarks

Spark RAPIDS Benchmarks – benchmark sets and utilities for the RAPIDS Accelerator for Apache Spark
Apache License 2.0
36 stars 27 forks source link

Need some CI pipelines to validate the scripts to avoid any mistake #27

Open GaryShen2008 opened 2 years ago

wjxiz1992 commented 2 years ago

We can add a pre-merge CI job for this repo.

wjxiz1992 commented 2 years ago
Generate base data local | python3 nds_gen_data.py local 1 2 $PWD/raw_sf1 --overwrite_output -- | -- Generate base data hdfs | python3 nds_gen_data.py hdfs 1 2 hdfs:/nds2.0_ci/raw_sf1 --overwrite_output Generate refresh data local | python3 nds_gen_data.py local 1 2 /user/$USER/raw_refresh_sf1 --overwrite_output --update Generate refresh data hdfs | python3 nds_gen_data.py hdfs 1 2 hdfs:/nds2.0_ci/raw_refresh_sf1 --overwrite_output --update Convert fresh data to parquet hdfs | ./spark-submit-template convert_submits_gpu.template nds_transcode.py hdfs:/nds2.0_ci/raw_refresh_sf1 hdfs:/nds2.0_ci/parquet_refresh_sf1 report.txt --output_format parquet --output_mode overwrite --update Convert base data to iceberg hdfs | ./spark-submit-template convert_submits_gpu.template nds_transcode.py hdfs:/nds2.0_ci/raw_sf1 hdfs:/nds2.0_ci/iceberg_sf1 report.txt --output_format iceberg --output_mode overwrite Generate query stream | python nds_gen_query_stream.py $TPCDS_HOME/query_templates 3000 ./query_streams --streams 1 Power run | ./spark-submit-template power_run_gpu.template \nds_power.py \hdfs:/nds2.0_ci/iceberg_sf1 \./nds_query_streams/query_0.sql \time.csv \--property_file properties/aqe-on.properties --input_format iceberg --output_prefix hdfs:/nds2.0_ci/gpu_output_sf1 Data validation | python nds_validate.py \hdfs:/nds2.0_ci/gpu_output_sf1 \hdfs:/nds2.0_ci/cpu_output_sf1 \./nds_query_streams/query_0.sql \--ignore_ordering Data maintenance | ./spark-submit-template convert_submit_gpu_iceberg.template \nds_maintenance.py \hdfs:/nds2.0_ci/parquet_refresh_sf1./data_maintenance \time.csv Throughput run | ./nds-throughput 1,2 \./spark-submit-template power_run_gpu.template \nds_power.py \hdfs:/nds2.0_ci/iceberg_sf1 \./nds_query_streams/query_'{}'.sql \Time_'{}'.csv \--input_format iceberg --output_prefix hdfs:/nds2.0_ci/gpu_output_sf1