Open mattahrens opened 9 months ago
Hi Matt, after some discussion with @GaryShen2008 several things to confirm:
Do we need to latest TPC-H tool version? If we want to be able to execute the whole TPC-H benchmark as soon as possible, we can leverage https://github.com/databricks/spark-sql-perf?tab=readme-ov-file#tpc-h directly to generate TPC-H data and run TPC-H queries. But note the TPC-H version it uses is still v2.4.0 while the latest is v3.0.1. I do see a bunch of patches in the TPC-H specifications PDF file so I think it's an issue. If we want to use the latest TPC-H tool, the effort will be similar to the one for NDS.
code structure change
There're some NDS specific code like https://github.com/NVIDIA/spark-rapids-benchmarks/blob/dev/nds/nds_gen_data.py#L42-L68 but also a lot of general code like https://github.com/NVIDIA/spark-rapids-benchmarks/blob/dev/nds/nds_power.py#L125-L135. now all of them are under NDS
folder. If we want good looking code, a refactor will be necessary. but if we want short-term goad, for example, we want to be able to run TPC-H ASAP, we can just create an NDH
folder, and put in existing code like https://github.com/databricks/spark-sql-perf/blob/master/src/main/notebooks/TPC-multi_datagen.scala along with some simple wrapper code to make it work.
These are the current gaps we see according to previous related work.
I would like to add another benchmark to the repository to support additional workloads for comparison. The TPC-H benchmark is used by different partners for comparison so we can enable the execution of a TPC-H similar workload benchmark. The requirements are similar to what we have for NDS:
Data generation
Query generation
Power run execution
We can add additional requirements once the initial NH scripts are set up to more closely match how we execute NDS.
Relevant links of other repos that execute TPC-H workloads:
Disclaimers for TPC-H: