NVIDIA / spark-rapids-benchmarks

Spark RAPIDS Benchmarks – benchmark sets and utilities for the RAPIDS Accelerator for Apache Spark
Apache License 2.0
36 stars 27 forks source link

[FEA] Create NDS-H benchmark for performance analysis #182

Open mattahrens opened 7 months ago

mattahrens commented 7 months ago

I would like to add another benchmark to the repository to support additional workloads for comparison. The TPC-H benchmark is used by different partners for comparison so we can enable the execution of a TPC-H similar workload benchmark. The requirements are similar to what we have for NDS:

Data generation

Query generation

Power run execution

We can add additional requirements once the initial NH scripts are set up to more closely match how we execute NDS.

Relevant links of other repos that execute TPC-H workloads:

Disclaimers for TPC-H:

wjxiz1992 commented 6 months ago

Hi Matt, after some discussion with @GaryShen2008 several things to confirm:

  1. Do we need to latest TPC-H tool version? If we want to be able to execute the whole TPC-H benchmark as soon as possible, we can leverage https://github.com/databricks/spark-sql-perf?tab=readme-ov-file#tpc-h directly to generate TPC-H data and run TPC-H queries. But note the TPC-H version it uses is still v2.4.0 while the latest is v3.0.1. I do see a bunch of patches in the TPC-H specifications PDF file so I think it's an issue. If we want to use the latest TPC-H tool, the effort will be similar to the one for NDS.

  2. code structure change There're some NDS specific code like https://github.com/NVIDIA/spark-rapids-benchmarks/blob/dev/nds/nds_gen_data.py#L42-L68 but also a lot of general code like https://github.com/NVIDIA/spark-rapids-benchmarks/blob/dev/nds/nds_power.py#L125-L135. now all of them are under NDS folder. If we want good looking code, a refactor will be necessary. but if we want short-term goad, for example, we want to be able to run TPC-H ASAP, we can just create an NDH folder, and put in existing code like https://github.com/databricks/spark-sql-perf/blob/master/src/main/notebooks/TPC-multi_datagen.scala along with some simple wrapper code to make it work.

These are the current gaps we see according to previous related work.

mattahrens commented 6 months ago
  1. Yes, let's use the latest version of the TPC-H tool version. I believe the other repo links I provided in the issue description may be using the latest version.
  2. Let's start with just bringing up NH benchmark and then we can refactor to have common utilities between NDS and NH.