fix bablestream benchmark

mehmetyusufoglu commented 2 weeks ago

Some of the 5 kernels of Babelstream-benchmark were not connected to each other, with this change if one of them changed somehow and fails; error is cached in the last result. ( Since we don't check after each kernel run, this is needed to make sure all kernels are connected.)
Using arrays in a different order in calling different kernels might affect the performance (although not observed) due to caching, therefore using same arrays for the same kernels (as in the original babelstream of UoB) in the kernel call sequence is also done by above change.
An optional kernel is added, NStream. This can be run separately alone.
One of the 5 kernels of babelstream, the triad kernel, was optionally being run alone in the original code by UoB. This option is also added.

This PR is an extension of previous PR: #2299

New parameters and kernel calls with specific arrays in the kernel call sequence to avoid cache usage differences:

A = 0.1 B= 0.2 C= 0.0 scalar = 0.4 C = A // copy B = scalar C // mult C = A + B // add A = B + scalar C // triad Missing optional kernel NStream is added Dot kernel is only run by multiple-threaded accs. Since original babelstream uses fixed 1024 blocksize which is the shared memory size per block as well. (search "#define TBSIZE 1024" in This Cuda code)

RESULTS

TEST_RUN : ./babelstream --array-size=33554432 --number-runs=100

./babelstream --array-size=33554432 --number-runs=100
Array size set to: 33554432
Number of runs provided: 100
Randomness seeded to: 3184604301
Kernels: Init, Copy, Mul, Add, Triad Kernels (and Dot kernel, if acc is multi-thread per block.)

AcceleratorType:AccGpuCudaRt<1,unsigned int>
NumberOfRuns:100
Precision:single
DataSize(items):33554432
DeviceName:NVIDIA RTX A500 Laptop GPU
WorkDivInit :{gridBlockExtent: (32768), blockThreadExtent: (1024), threadElemExtent: (1)}
WorkDivCopy :{gridBlockExtent: (32768), blockThreadExtent: (1024), threadElemExtent: (1)}
WorkDivMult :{gridBlockExtent: (32768), blockThreadExtent: (1024), threadElemExtent: (1)}
WorkDivAdd  :{gridBlockExtent: (32768), blockThreadExtent: (1024), threadElemExtent: (1)}
WorkDivTriad:{gridBlockExtent: (32768), blockThreadExtent: (1024), threadElemExtent: (1)}
WorkDivDot  :{gridBlockExtent: (256), blockThreadExtent: (1024), threadElemExtent: (1)}
AccToHost Memcpy Time(sec):0.223933
Kernels         Bandwidths(GB/s) MinTime(s) MaxTime(s) AvgTime(s) DataUsage(MB) 
 AddKernel       91.371          0.0044068 0.0044819 0.0044661 402.65 
 CopyKernel      90.075          0.0029801 0.0030822 0.0030193 268.44 
 DotKernel       92.759          0.0028939 0.0029579 0.0029319 268.44 
 InitKernel      92.418          0.0043569 0.0043569 0.0043569 402.65 
 MultKernel      90.276          0.0029735 0.0030676 0.003011 268.44 
 TriadKernel     90.763          0.0044363 0.0044944 0.0044705 402.65 

Kernels: Init, Copy, Mul, Add, Triad Kernels (and Dot kernel, if acc is multi-thread per block.)

AcceleratorType:AccGpuCudaRt<1,unsigned int>
NumberOfRuns:100
Precision:double
DataSize(items):33554432
DeviceName:NVIDIA RTX A500 Laptop GPU
WorkDivInit :{gridBlockExtent: (32768), blockThreadExtent: (1024), threadElemExtent: (1)}
WorkDivCopy :{gridBlockExtent: (32768), blockThreadExtent: (1024), threadElemExtent: (1)}
WorkDivMult :{gridBlockExtent: (32768), blockThreadExtent: (1024), threadElemExtent: (1)}
WorkDivAdd  :{gridBlockExtent: (32768), blockThreadExtent: (1024), threadElemExtent: (1)}
WorkDivTriad:{gridBlockExtent: (32768), blockThreadExtent: (1024), threadElemExtent: (1)}
WorkDivDot  :{gridBlockExtent: (256), blockThreadExtent: (1024), threadElemExtent: (1)}
AccToHost Memcpy Time(sec):0.570856
Kernels         Bandwidths(GB/s) MinTime(s) MaxTime(s) AvgTime(s) DataUsage(MB) 
 AddKernel       90.665          0.0088822 0.0089985 0.0089376 805.31 
 CopyKernel      89.087          0.0060264 0.0061119 0.0060773 536.87 
 DotKernel       93.055          0.0057694 0.0058486 0.0058113 536.87 
 InitKernel      84.437          0.0095374 0.0095374 0.0095374 805.31 
 MultKernel      89.35           0.0060086 0.0060852 0.0060568 536.87 
 TriadKernel     90.222          0.0089258 0.0090338 0.0089565 805.31 

===============================================================================
All tests passed (8 assertions in 2 test cases)

psychocoderHPC commented 1 week ago

@mehmetyusufoglu Can you please check if CPU is working too.

mehmetyusufoglu commented 3 days ago

Hi, thanks for your update. I've mostly checked for compliance with upstream babelstream as that was apparently a major point of discussion. But I want to first applaud you for that nice auto [i] = getIdx...; trick. Very nice indeed!

It's mostly small things I've found some of which we might actively decide to do differently, e.g., not measuring some timings of the infrastructure-ish calls. There was one section that GH didn't allow me to comment on, so I'll put that in here: Technically speaking, the dot kernel does a slightly different thing by accumulating the threadSums in registers and only storing them once into shared memory. Our version vs. upstream version. Likely to be optimised away by the compiler because both versions leave full flexibility concerning memory ordering here but we can't be sure I believe.

Also, just for my information: Why is tbSum a reference in that same kernel? It very much looks like it must be dangling but if this compiles and runs correctly it apparently isn't?

tbSum is reference because the function return type is -> T& and returns a dereferenced value return *data;

mehmetyusufoglu commented 3 days ago

Hi, thanks for your update. I've mostly checked for compliance with upstream babelstream as that was apparently a major point of discussion. But I want to first applaud you for that nice auto [i] = getIdx...; trick. Very nice indeed!

It's mostly small things I've found some of which we might actively decide to do differently, e.g., not measuring some timings of the infrastructure-ish calls. There was one section that GH didn't allow me to comment on, so I'll put that in here: Technically speaking, the dot kernel does a slightly different thing by accumulating the threadSums in registers and only storing them once into shared memory. Our version vs. upstream version. Likely to be optimised away by the compiler because both versions leave full flexibility concerning memory ordering here but we can't be sure I believe.

Yes, this was the choice at the first implementation at our repo, i used directly like in the cuda implementation now. Checking the performance.

chillenzer commented 3 days ago

tbSum is reference because the function return type is -> T& and returns a dereferenced value return *data;

Thanks for the explanation! That makes sense.

Concerning the reduce implementation, I had an offline discussion with @psychocoderHPC: The concept to benchmark here is any implementation of a reduction based on alpaka. In that sense, we are not required to follow the reference implementation precisely. Not hammering on shared memory with every thread is probably a worthwhile change.

mehmetyusufoglu commented 2 days ago

@mehmetyusufoglu Can you please check if CPU is working too.

Randomness seeded to: 2905169299 Kernels: Init, Copy, Mul, Add, Triad Kernels (and Dot kernel, if acc is multi-thread per block.)

AcceleratorType:AccCpuSerial<1,unsigned int> NumberOfRuns:2 Precision:single DataSize(items):1048576 DeviceName:13th Gen Intel(R) Core(TM) i7-1360P WorkDivInit :{gridBlockExtent: (1048576), blockThreadExtent: (1), threadElemExtent: (1)} WorkDivCopy :{gridBlockExtent: (1048576), blockThreadExtent: (1), threadElemExtent: (1)} WorkDivMult :{gridBlockExtent: (1048576), blockThreadExtent: (1), threadElemExtent: (1)} WorkDivAdd :{gridBlockExtent: (1048576), blockThreadExtent: (1), threadElemExtent: (1)} WorkDivTriad:{gridBlockExtent: (1048576), blockThreadExtent: (1), threadElemExtent: (1)} AccToHost Memcpy Time(sec):0.0107734 Kernels Bandwidths(GB/s) MinTime(s) MaxTime(s) AvgTime(s) DataUsage(MB) AddKernel 0.026882 0.46807 0.46807 0.46807 12.583 CopyKernel 0.019559 0.42889 0.42889 0.42889 8.3886 InitKernel 0.029203 0.43088 0.43088 0.43088 12.583 MultKernel 0.019739 0.42498 0.42498 0.42498 8.3886 TriadKernel 0.025445 0.49452 0.49452 0.49452 12.583

Kernels: Init, Copy, Mul, Add, Triad Kernels (and Dot kernel, if acc is multi-thread per block.)

AcceleratorType:AccGpuCudaRt<1,unsigned int> NumberOfRuns:2 Precision:single DataSize(items):1048576 DeviceName:NVIDIA RTX A500 Laptop GPU WorkDivInit :{gridBlockExtent: (1024), blockThreadExtent: (1024), threadElemExtent: (1)} WorkDivCopy :{gridBlockExtent: (1024), blockThreadExtent: (1024), threadElemExtent: (1)} WorkDivMult :{gridBlockExtent: (1024), blockThreadExtent: (1024), threadElemExtent: (1)} WorkDivAdd :{gridBlockExtent: (1024), blockThreadExtent: (1024), threadElemExtent: (1)} WorkDivTriad:{gridBlockExtent: (1024), blockThreadExtent: (1024), threadElemExtent: (1)} WorkDivDot :{gridBlockExtent: (256), blockThreadExtent: (1024), threadElemExtent: (1)} AccToHost Memcpy Time(sec):0.0135214 Kernels Bandwidths(GB/s) MinTime(s) MaxTime(s) AvgTime(s) DataUsage(MB) AddKernel 80.998 0.00015535 0.00015535 0.00015535 12.583 CopyKernel 69.936 0.00011995 0.00011995 0.00011995 8.3886 DotKernel 48.621 0.00017253 0.00017253 0.00017253 8.3886 InitKernel 51.814 0.00024285 0.00024285 0.00024285 12.583 MultKernel 76.158 0.00011015 0.00011015 0.00011015 8.3886 TriadKernel 81.478 0.00015443 0.00015443 0.00015443 12.583

Kernels: Init, Copy, Mul, Add, Triad Kernels (and Dot kernel, if acc is multi-thread per block.)

AcceleratorType:AccCpuSerial<1,unsigned int> NumberOfRuns:2 Precision:double DataSize(items):1048576 DeviceName:13th Gen Intel(R) Core(TM) i7-1360P WorkDivInit :{gridBlockExtent: (1048576), blockThreadExtent: (1), threadElemExtent: (1)} WorkDivCopy :{gridBlockExtent: (1048576), blockThreadExtent: (1), threadElemExtent: (1)} WorkDivMult :{gridBlockExtent: (1048576), blockThreadExtent: (1), threadElemExtent: (1)} WorkDivAdd :{gridBlockExtent: (1048576), blockThreadExtent: (1), threadElemExtent: (1)} WorkDivTriad:{gridBlockExtent: (1048576), blockThreadExtent: (1), threadElemExtent: (1)} AccToHost Memcpy Time(sec):0.0151765 Kernels Bandwidths(GB/s) MinTime(s) MaxTime(s) AvgTime(s) DataUsage(MB) AddKernel 0.059712 0.42146 0.42146 0.42146 25.166 CopyKernel 0.042238 0.39721 0.39721 0.39721 16.777 InitKernel 0.03913 0.64314 0.64314 0.64314 25.166 MultKernel 0.04646 0.36111 0.36111 0.36111 16.777 TriadKernel 0.062699 0.40138 0.40138 0.40138 25.166

Kernels: Init, Copy, Mul, Add, Triad Kernels (and Dot kernel, if acc is multi-thread per block.)

AcceleratorType:AccGpuCudaRt<1,unsigned int> NumberOfRuns:2 Precision:double DataSize(items):1048576 DeviceName:NVIDIA RTX A500 Laptop GPU WorkDivInit :{gridBlockExtent: (1024), blockThreadExtent: (1024), threadElemExtent: (1)} WorkDivCopy :{gridBlockExtent: (1024), blockThreadExtent: (1024), threadElemExtent: (1)} WorkDivMult :{gridBlockExtent: (1024), blockThreadExtent: (1024), threadElemExtent: (1)} WorkDivAdd :{gridBlockExtent: (1024), blockThreadExtent: (1024), threadElemExtent: (1)} WorkDivTriad:{gridBlockExtent: (1024), blockThreadExtent: (1024), threadElemExtent: (1)} WorkDivDot :{gridBlockExtent: (256), blockThreadExtent: (1024), threadElemExtent: (1)} AccToHost Memcpy Time(sec):0.0173797 Kernels Bandwidths(GB/s) MinTime(s) MaxTime(s) AvgTime(s) DataUsage(MB) AddKernel 87.21 0.00028857 0.00028857 0.00028857 25.166 CopyKernel 80.442 0.00020856 0.00020856 0.00020856 16.777 DotKernel 73.262 0.000229 0.000229 0.000229 16.777 InitKernel 85.267 0.00029514 0.00029514 0.00029514 25.166 MultKernel 85.196 0.00019693 0.00019693 0.00019693 16.777 TriadKernel 87.512 0.00028757 0.00028757 0.00028757 25.166

=============================================================================== All tests passed (18 assertions in 4 test cases)

mehmetyusufoglu commented 2 days ago

tbSum is reference because the function return type is -> T& and returns a dereferenced value return *data;

Thanks for the explanation! That makes sense.

Concerning the reduce implementation, I had an offline discussion with @psychocoderHPC: The concept to benchmark here is any implementation of a reduction based on alpaka. In that sense, we are not required to follow the reference implementation precisely. Not hammering on shared memory with every thread is probably a worthwhile change.

Ok I reverted it back. (Yes accessing shared memory at each thread many times is not needed at such case)

alpaka-group / alpaka

fix bablestream benchmark #2420