NVIDIA / nvbench

CUDA Kernel Benchmarking Library
Apache License 2.0
528 stars 66 forks source link

Passing method with multiple template arguments to NVBENCH_BENCH macro #101

Open hibagus opened 2 years ago

hibagus commented 2 years ago

Hi there,

I would like to integrate nvbench on my C++ apps. The method that runs the GPU kernel is a template method as follows.

template<typename Gemm, typename scalePrecision, typename mulPrecision, typename accPrecision>
int gemm_cutlass_launch_int(nvbench::state& state)

Then, I pass the method using template arguments to NVBENCH_BENCH as follows.

NVBENCH_BENCH(gemm_cutlass_launch_int<Gemm, scalePrecision, mulPrecision, accPrecision>);
NVBENCH_MAIN_BODY(gargc_nvbench, gargv_nvbench);

It gives me error as follows:

error: macro "NVBENCH_BENCH" passed 4 arguments, but takes just 1
NVBENCH_BENCH(gemm_cutlass_launch_int<Gemm, scalePrecision, mulPrecision, accPrecision>);

Seems like the MACRO does not like "comma" on the template arguments. I've read this and this, but none of them are working.

Any help would be highly appreciated.

Thanks!

jrhemstad commented 2 years ago

Hey @hibagus, thanks for your interest in NVBench and for reaching out! We'll be happy to help.

In your template:

template<typename Gemm, typename scalePrecision, typename mulPrecision, typename accPrecision>
int gemm_cutlass_launch_int(nvbench::state& state)

Are the Gemm, scalePrecision, mulPrecision, accPrecision template parameters things you hope to sweep across a variety of types using nvbench? Or will these be fixed for a particular benchmark invocation?

hibagus commented 2 years ago

Hi @jrhemstad, thanks for the reply. Currently, my implementation will not use the type sweep on NVBench (i.e., not using NVBENCH_BENCH_TYPES) so the template parameters will be fixed for a particular benchmark invocation.

Actually, I have tried using NVBENCH_BENCH_TYPES to invoke the benchmark as follows:

template<typename Gemm, typename scalePrecision, typename mulPrecision, typename accPrecision>
int gemm_cutlass_launch_int(nvbench::state& state, nvbench::type_list<Gemm, scalePrecision, mulPrecision, accPrecision>)

Then:

using gemm_types = nvbench::type_list<Gemm>;
using scalar_types = nvbench::type_list<scalePrecision>;
using multiply_types = nvbench::type_list<mulPrecision>;
using accumulation_types = nvbench::type_list<accPrecision>;

NVBENCH_BENCH_TYPES(gemm_cutlass_launch_int, NVBENCH_TYPE_AXES(gemm_types, scalar_types, multiply_types, accumulation_types));
NVBENCH_MAIN_BODY(gargc_nvbench, gargv_nvbench);

I am not sure if I did it correctly or not. The compilation shows error message as follows: error: a template declaration is not allowed here

jrhemstad commented 2 years ago

Currently, my implementation will not use the type sweep on NVBench (i.e., not using NVBENCH_BENCH_TYPES) so the template parameters will be fixed for a particular benchmark invocation.

Okay, that makes sense. Then yeah, what you originally had won't work just because of how the preprocessor works in C/C++, it doesn't like commas when using a macro.

The good news is that there's an easy workaround. You can just wrap your template instantiation in an extra set of parenthesis:

NVBENCH_BENCH( (gemm_cutlass_launch_int<Gemm, scalePrecision, mulPrecision, accPrecision>) );

Example: https://godbolt.org/z/xzocY9Ex3

hibagus commented 2 years ago

Hi @jrhemstad

I have tried that before by enclosing the brackets, but it still gives me an error as follows;

In file included from /home/bagus/CUDA_Bench/libs/nvbench/include/nvbench/nvbench.cuh:24,
                 from /home/bagus/CUDA_Bench/include/CUDA_Bench/gemm/gemm_cutlass_launch_int.cuh:12,
                 from /home/bagus/CUDA_Bench/src/gemm/gemm_cutlass_launch_int.cu:1:
/home/bagus/CUDA_Bench/src/gemm/gemm_cutlass_launch_int.cu:324:110: error: pasting ")" and "_line_" does not give a valid preprocessing token
  324 |         NVBENCH_BENCH( (gemm_cutlass_launch_int<gemm_types, scalar_types, multiply_types, accumulation_types>) );
      |                                                                                                              ^
/home/bagus/CUDA_Bench/libs/nvbench/include/nvbench/callable.cuh:58:60: note: in definition of macro ‘NVBENCH_UNIQUE_IDENTIFIER_IMPL2’
   58 | #define NVBENCH_UNIQUE_IDENTIFIER_IMPL2(prefix, unique_id) prefix##_line_##unique_id
      |                                                            ^~~~~~
/home/bagus/CUDA_Bench/libs/nvbench/include/nvbench/callable.cuh:55:43: note: in expansion of macro ‘NVBENCH_UNIQUE_IDENTIFIER_IMPL1’
   55 | #define NVBENCH_UNIQUE_IDENTIFIER(prefix) NVBENCH_UNIQUE_IDENTIFIER_IMPL1(prefix, __LINE__)
      |                                           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/bagus/CUDA_Bench/libs/nvbench/include/nvbench/callable.cuh:34:37: note: in expansion of macro ‘NVBENCH_UNIQUE_IDENTIFIER’
   34 |   NVBENCH_DEFINE_CALLABLE(function, NVBENCH_UNIQUE_IDENTIFIER(function))
      |                                     ^~~~~~~~~~~~~~~~~~~~~~~~~
/home/bagus/CUDA_Bench/libs/nvbench/include/nvbench/create.cuh:31:3: note: in expansion of macro ‘NVBENCH_DEFINE_UNIQUE_CALLABLE’
   31 |   NVBENCH_DEFINE_UNIQUE_CALLABLE(KernelGenerator);                                                 \
      |   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/bagus/CUDA_Bench/src/gemm/gemm_cutlass_launch_int.cu:324:9: note: in expansion of macro ‘NVBENCH_BENCH’
  324 |         NVBENCH_BENCH( (gemm_cutlass_launch_int<gemm_types, scalar_types, multiply_types, accumulation_types>) );

Any suggestions?

jrhemstad commented 2 years ago

Ah, this looks to be unique to some of nvbench's internal macro shenanigans underlying NVBENCH_BENCH.

I don't think using NVBENCH_TYPE_AXES when you don't intend to sweep over those parameters is going to be the right solution. That said, I don't know why the example you showed doesn't work. It seems like it should.

One approach that likely isn't very satisfying would be to wrap invoking your template instantiation in another function that isn't a template:

int do_benchmark(nvbench::state& state){
   gemm_cutlass_launch_int<Gemm, scalePrecision, mulPrecision, accPrecision>(state); 
}

NVBENCH_BENCH(do_benchmark);

I'd have to defer to @allisonvacanti for a more clever solution than this.

hibagus commented 2 years ago

Thanks, @jrhemstad . I use that workaround for now, although it is not that convenient :)

alliepiper commented 2 years ago

Using single-element typelists should work. Can you share the full test case? It sounds like something odd is going on:

NVBENCH_BENCH_TYPES(gemm_cutlass_launch_int, NVBENCH_TYPE_AXES(gemm_types, scalar_types, multiply_types, accumulation_types));
NVBENCH_MAIN_BODY(gargc_nvbench, gargv_nvbench);

These two macros shouldn't be used from the same scope. NVBENCH_BENCH_TYPES should be used from global scope, while NVBENCH_MAIN_BODY should be used from function scope. Maybe you wanted NVBENCH_MAIN instead?

hibagus commented 2 years ago

Hi @allisonvacanti

Our project is accessible on Github. This is how we plan to integrate bench to our project. When I replace NVBENCH_BENCHwith NVBENCH_BENCH_TYPES, I get the following error message: error: a template declaration is not allowed here

That's why I use NVBENC_MAIN_BODY alongside NVBENCH_BENCHsince I would like it to use in the function scope.

alliepiper commented 2 years ago

Ah, ok. That's not how these macros are intended to be used -- I'm honestly surprised that this pattern works with NVBENCH_BENCH :-)

Take a look through the examples. The NVBENCH_BENCH* macros should be used at global scope, defining the benchmarks inside a function is not supported.

You can restrict the benchmarks that are executed at runtime by configuring argc and argv with the relevant -b and -a options and then call NVBENCH_MAIN_BODY(argc, argv).