Open mehmetyusufoglu opened 2 weeks ago
@mehmetyusufoglu Can you please check if CPU is working too.
Hi, thanks for your update. I've mostly checked for compliance with upstream babelstream as that was apparently a major point of discussion. But I want to first applaud you for that nice
auto [i] = getIdx...;
trick. Very nice indeed!It's mostly small things I've found some of which we might actively decide to do differently, e.g., not measuring some timings of the infrastructure-ish calls. There was one section that GH didn't allow me to comment on, so I'll put that in here: Technically speaking, the dot kernel does a slightly different thing by accumulating the
threadSum
s in registers and only storing them once into shared memory. Our version vs. upstream version. Likely to be optimised away by the compiler because both versions leave full flexibility concerning memory ordering here but we can't be sure I believe.Also, just for my information: Why is
tbSum
a reference in that same kernel? It very much looks like it must be dangling but if this compiles and runs correctly it apparently isn't?
tbSum is reference because the function return type is -> T&
and returns a dereferenced value return *data
;
Hi, thanks for your update. I've mostly checked for compliance with upstream babelstream as that was apparently a major point of discussion. But I want to first applaud you for that nice
auto [i] = getIdx...;
trick. Very nice indeed!It's mostly small things I've found some of which we might actively decide to do differently, e.g., not measuring some timings of the infrastructure-ish calls. There was one section that GH didn't allow me to comment on, so I'll put that in here: Technically speaking, the dot kernel does a slightly different thing by accumulating the
threadSum
s in registers and only storing them once into shared memory. Our version vs. upstream version. Likely to be optimised away by the compiler because both versions leave full flexibility concerning memory ordering here but we can't be sure I believe.
Yes, this was the choice at the first implementation at our repo, i used directly like in the cuda implementation now. Checking the performance.
tbSum is reference because the function return type is
-> T&
and returns a dereferenced valuereturn *data
;
Thanks for the explanation! That makes sense.
Concerning the reduce
implementation, I had an offline discussion with @psychocoderHPC: The concept to benchmark here is any implementation of a reduction based on alpaka. In that sense, we are not required to follow the reference implementation precisely. Not hammering on shared memory with every thread is probably a worthwhile change.
@mehmetyusufoglu Can you please check if CPU is working too.
Randomness seeded to: 2905169299 Kernels: Init, Copy, Mul, Add, Triad Kernels (and Dot kernel, if acc is multi-thread per block.)
AcceleratorType:AccCpuSerial<1,unsigned int> NumberOfRuns:2 Precision:single DataSize(items):1048576 DeviceName:13th Gen Intel(R) Core(TM) i7-1360P WorkDivInit :{gridBlockExtent: (1048576), blockThreadExtent: (1), threadElemExtent: (1)} WorkDivCopy :{gridBlockExtent: (1048576), blockThreadExtent: (1), threadElemExtent: (1)} WorkDivMult :{gridBlockExtent: (1048576), blockThreadExtent: (1), threadElemExtent: (1)} WorkDivAdd :{gridBlockExtent: (1048576), blockThreadExtent: (1), threadElemExtent: (1)} WorkDivTriad:{gridBlockExtent: (1048576), blockThreadExtent: (1), threadElemExtent: (1)} AccToHost Memcpy Time(sec):0.0107734 Kernels Bandwidths(GB/s) MinTime(s) MaxTime(s) AvgTime(s) DataUsage(MB) AddKernel 0.026882 0.46807 0.46807 0.46807 12.583 CopyKernel 0.019559 0.42889 0.42889 0.42889 8.3886 InitKernel 0.029203 0.43088 0.43088 0.43088 12.583 MultKernel 0.019739 0.42498 0.42498 0.42498 8.3886 TriadKernel 0.025445 0.49452 0.49452 0.49452 12.583
Kernels: Init, Copy, Mul, Add, Triad Kernels (and Dot kernel, if acc is multi-thread per block.)
AcceleratorType:AccGpuCudaRt<1,unsigned int> NumberOfRuns:2 Precision:single DataSize(items):1048576 DeviceName:NVIDIA RTX A500 Laptop GPU WorkDivInit :{gridBlockExtent: (1024), blockThreadExtent: (1024), threadElemExtent: (1)} WorkDivCopy :{gridBlockExtent: (1024), blockThreadExtent: (1024), threadElemExtent: (1)} WorkDivMult :{gridBlockExtent: (1024), blockThreadExtent: (1024), threadElemExtent: (1)} WorkDivAdd :{gridBlockExtent: (1024), blockThreadExtent: (1024), threadElemExtent: (1)} WorkDivTriad:{gridBlockExtent: (1024), blockThreadExtent: (1024), threadElemExtent: (1)} WorkDivDot :{gridBlockExtent: (256), blockThreadExtent: (1024), threadElemExtent: (1)} AccToHost Memcpy Time(sec):0.0135214 Kernels Bandwidths(GB/s) MinTime(s) MaxTime(s) AvgTime(s) DataUsage(MB) AddKernel 80.998 0.00015535 0.00015535 0.00015535 12.583 CopyKernel 69.936 0.00011995 0.00011995 0.00011995 8.3886 DotKernel 48.621 0.00017253 0.00017253 0.00017253 8.3886 InitKernel 51.814 0.00024285 0.00024285 0.00024285 12.583 MultKernel 76.158 0.00011015 0.00011015 0.00011015 8.3886 TriadKernel 81.478 0.00015443 0.00015443 0.00015443 12.583
Kernels: Init, Copy, Mul, Add, Triad Kernels (and Dot kernel, if acc is multi-thread per block.)
AcceleratorType:AccCpuSerial<1,unsigned int> NumberOfRuns:2 Precision:double DataSize(items):1048576 DeviceName:13th Gen Intel(R) Core(TM) i7-1360P WorkDivInit :{gridBlockExtent: (1048576), blockThreadExtent: (1), threadElemExtent: (1)} WorkDivCopy :{gridBlockExtent: (1048576), blockThreadExtent: (1), threadElemExtent: (1)} WorkDivMult :{gridBlockExtent: (1048576), blockThreadExtent: (1), threadElemExtent: (1)} WorkDivAdd :{gridBlockExtent: (1048576), blockThreadExtent: (1), threadElemExtent: (1)} WorkDivTriad:{gridBlockExtent: (1048576), blockThreadExtent: (1), threadElemExtent: (1)} AccToHost Memcpy Time(sec):0.0151765 Kernels Bandwidths(GB/s) MinTime(s) MaxTime(s) AvgTime(s) DataUsage(MB) AddKernel 0.059712 0.42146 0.42146 0.42146 25.166 CopyKernel 0.042238 0.39721 0.39721 0.39721 16.777 InitKernel 0.03913 0.64314 0.64314 0.64314 25.166 MultKernel 0.04646 0.36111 0.36111 0.36111 16.777 TriadKernel 0.062699 0.40138 0.40138 0.40138 25.166
Kernels: Init, Copy, Mul, Add, Triad Kernels (and Dot kernel, if acc is multi-thread per block.)
AcceleratorType:AccGpuCudaRt<1,unsigned int> NumberOfRuns:2 Precision:double DataSize(items):1048576 DeviceName:NVIDIA RTX A500 Laptop GPU WorkDivInit :{gridBlockExtent: (1024), blockThreadExtent: (1024), threadElemExtent: (1)} WorkDivCopy :{gridBlockExtent: (1024), blockThreadExtent: (1024), threadElemExtent: (1)} WorkDivMult :{gridBlockExtent: (1024), blockThreadExtent: (1024), threadElemExtent: (1)} WorkDivAdd :{gridBlockExtent: (1024), blockThreadExtent: (1024), threadElemExtent: (1)} WorkDivTriad:{gridBlockExtent: (1024), blockThreadExtent: (1024), threadElemExtent: (1)} WorkDivDot :{gridBlockExtent: (256), blockThreadExtent: (1024), threadElemExtent: (1)} AccToHost Memcpy Time(sec):0.0173797 Kernels Bandwidths(GB/s) MinTime(s) MaxTime(s) AvgTime(s) DataUsage(MB) AddKernel 87.21 0.00028857 0.00028857 0.00028857 25.166 CopyKernel 80.442 0.00020856 0.00020856 0.00020856 16.777 DotKernel 73.262 0.000229 0.000229 0.000229 16.777 InitKernel 85.267 0.00029514 0.00029514 0.00029514 25.166 MultKernel 85.196 0.00019693 0.00019693 0.00019693 16.777 TriadKernel 87.512 0.00028757 0.00028757 0.00028757 25.166
=============================================================================== All tests passed (18 assertions in 4 test cases)
tbSum is reference because the function return type is
-> T&
and returns a dereferenced valuereturn *data
;Thanks for the explanation! That makes sense.
Concerning the
reduce
implementation, I had an offline discussion with @psychocoderHPC: The concept to benchmark here is any implementation of a reduction based on alpaka. In that sense, we are not required to follow the reference implementation precisely. Not hammering on shared memory with every thread is probably a worthwhile change.
Ok I reverted it back. (Yes accessing shared memory at each thread many times is not needed at such case)
NStream
. This can be run separately alone.One of the 5 kernels of babelstream,
the triad kernel,
was optionally being run alone in the original code by UoB. This option is also added.This PR is an extension of previous PR: #2299
New parameters and kernel calls with specific arrays in the kernel call sequence to avoid cache usage differences:
A = 0.1 B= 0.2 C= 0.0 scalar = 0.4 C = A // copy B = scalar C // mult C = A + B // add A = B + scalar C // triad Missing optional kernel NStream is added
Dot kernel
is only run bymultiple-threaded accs
. Since original babelstream uses fixed 1024 blocksize which is the shared memory size per block as well. (search "#define TBSIZE 1024" in This Cuda code)RESULTS