[FEA] Create NDS-H benchmark for performance analysis

mattahrens commented 9 months ago

I would like to add another benchmark to the repository to support additional workloads for comparison. The TPC-H benchmark is used by different partners for comparison so we can enable the execution of a TPC-H similar workload benchmark. The requirements are similar to what we have for NDS:

Data generation

[x] P0: Support generation of raw data at various scale factors
[x] P0: Support conversion of raw data to Parquet
[ ] P1: Support conversion of raw data to ORC
[ ] P1: Support conversion of raw data to CSV

Query generation

[x] P0: Support generation of queries at various scale factors

Power run execution

[x] P0: Support execution of full query set given a specified input path
[x] P1: Support execution of individual query given a specific query and input path

We can add additional requirements once the initial NH scripts are set up to more closely match how we execute NDS.

Relevant links of other repos that execute TPC-H workloads:

Disclaimers for TPC-H:

TPC-H is Copyright © 1993-2024 Transaction Processing Performance Council. The full TPC-H specification in PDF format can be found here
TPC, TPC Benchmark, and TPC-H are trademarks of the Transaction Processing Performance Council.

wjxiz1992 commented 8 months ago

Hi Matt, after some discussion with @GaryShen2008 several things to confirm:

Do we need to latest TPC-H tool version? If we want to be able to execute the whole TPC-H benchmark as soon as possible, we can leverage https://github.com/databricks/spark-sql-perf?tab=readme-ov-file#tpc-h directly to generate TPC-H data and run TPC-H queries. But note the TPC-H version it uses is still v2.4.0 while the latest is v3.0.1. I do see a bunch of patches in the TPC-H specifications PDF file so I think it's an issue. If we want to use the latest TPC-H tool, the effort will be similar to the one for NDS.
code structure change There're some NDS specific code like https://github.com/NVIDIA/spark-rapids-benchmarks/blob/dev/nds/nds_gen_data.py#L42-L68 but also a lot of general code like https://github.com/NVIDIA/spark-rapids-benchmarks/blob/dev/nds/nds_power.py#L125-L135. now all of them are under NDS folder. If we want good looking code, a refactor will be necessary. but if we want short-term goad, for example, we want to be able to run TPC-H ASAP, we can just create an NDH folder, and put in existing code like https://github.com/databricks/spark-sql-perf/blob/master/src/main/notebooks/TPC-multi_datagen.scala along with some simple wrapper code to make it work.

These are the current gaps we see according to previous related work.

mattahrens commented 8 months ago

Yes, let's use the latest version of the TPC-H tool version. I believe the other repo links I provided in the issue description may be using the latest version.
Let's start with just bringing up NH benchmark and then we can refactor to have common utilities between NDS and NH.

NVIDIA / spark-rapids-benchmarks

[FEA] Create NDS-H benchmark for performance analysis #182