Add a convenient class to generate TPC-DS data

wangyum commented 3 years ago

How to use it:

build/sbt "test:runMain com.databricks.spark.sql.perf.tpcds.GenTPCDSData -d /root/tmp/tpcds-kit/tools -s 5 -l /root/tmp/tpcds5g -f parquet"

[root@spark-3267648 spark-sql-perf]# build/sbt "test:runMain com.databricks.spark.sql.perf.tpcds.GenTPCDSData --help"
[info] Running com.databricks.spark.sql.perf.tpcds.GenTPCDSData --help
[info] Usage: Gen-TPC-DS-data [options]
[info]
[info]   -m, --master <value>     the Spark master to use, default to local[*]
[info]   -d, --dsdgenDir <value>  location of dsdgen
[info]   -s, --scaleFactor <value>
[info]                            scaleFactor defines the size of the dataset to generate (in GB)
[info]   -l, --location <value>   root directory of location to create data in
[info]   -f, --format <value>     valid spark format, Parquet, ORC ...
[info]   -i, --useDoubleForDecimal <value>
[info]                            true to replace DecimalType with DoubleType
[info]   -e, --useStringForDate <value>
[info]                            true to replace DateType with StringType
[info]   -o, --overwrite <value>  overwrite the data that is already there
[info]   -p, --partitionTables <value>
[info]                            create the partitioned fact tables
[info]   -c, --clusterByPartitionColumns <value>
[info]                            shuffle to get partitions coalesced into single files
[info]   -v, --filterOutNullPartitionValues <value>
[info]                            true to filter out the partition with NULL key value
[info]   -t, --tableFilter <value>
[info]                            "" means generate all tables
[info]   -n, --numPartitions <value>
[info]                            how many dsdgen partitions to run - number of input tasks.
[info]   --help                   prints this usage text

HyukjinKwon commented 3 years ago

@wangyum, is it ready for a review?

wangyum commented 3 years ago

Yes.

HyukjinKwon commented 3 years ago

I will leave it @npoggi for the final look.

databricks / spark-sql-perf

Add a convenient class to generate TPC-DS data #196