Open bernhardmgruber opened 1 year ago
NVIDIA often advises that 70% - 80% of the theoretical bandwidth are already quite good. The benchmark here achieves 90%. I'm not sure how worthwhile this endeavour would be and how much there is left to gain without spending countless hours for micro-optimizations.
Edit: I would also suggest to test this on an HPC-grade GPU.
NVIDIA often advises that 70% - 80% of the theoretical bandwidth are already quite good.
I guess that refers to kernels in general. Babelstream is almost pure loads and stores, so I assume we can get closer here.
Edit: I would also suggest to test this on an HPC-grade GPU.
I can try, sure!
V100:
BabelStream
Version: 4.0
Implementation: alpaka
Running kernels 100 times
Precision: double
Array size: 268.4 MB (=0.3 GB)
Total size: 805.3 MB (=0.8 GB)
Using alpaka device Tesla V100-SXM2-32GB
Function MBytes/sec Min (sec) Max Average
Copy 789949.358 0.00068 0.00069 0.00068
Mul 786742.779 0.00068 0.00069 0.00068
Add 810901.589 0.00099 0.00101 0.00100
Triad 812557.632 0.00099 0.00101 0.00100
Dot 822284.067 0.00065 0.00066 0.00066
The GPU's bandwidth according to the specification is 897.0 GB/s, so we are similarly close. The dot kernel is actually the fastest and it is the only one using a grid-strided loop. This could be a hint that the other kernels should also be grid-strided.
Btw, this is how we compare with the BabelStream CUDA:
I tried using grid strided loops today and they actually worsen performance. Tested on RTX 2060 and A100.
The alpaka version with SYCL backend compares bad against the SYCL BabelStream (upstream):
All BabelStream versions ran with babelstream --float -s $((2**30))
. If multiple devices are available (e.g. when running upstream BabelStream), make sure to use the oneAPI Level Zero platform and not OpenCL. The latter crashes beyond size of 2**29
.
[update from @psychocoderHPC] Tested on ARC770.
For HIP, we are fine:
For Intel SYCL it makes sense to load data as vector 2 e.g. float2
# --mibibytes is required because of the hacky decision of 2 in the load data loop
./example/babelstream/babelstream --float -s $((2**29)) --mibibytes
# original dev results on ARC770
Dot 277580.082 0.01476 0.01640 0.01490
# after the tiny hacky patch
Dot 395646.154 0.01035 0.01177 0.01043
diff --git a/example/babelstream/src/AlpakaStream.cpp b/example/babelstream/src/AlpakaStream.cpp
index bca66c7cfbe..27495761727 100644
--- a/example/babelstream/src/AlpakaStream.cpp
+++ b/example/babelstream/src/AlpakaStream.cpp
@@ -185,6 +185,28 @@ void AlpakaStream<T>::nstream()
alpaka::wait(queue);
}
+template<typename Type>
+struct VecTT
+{
+ Type x;
+ Type y;
+ ALPAKA_FN_ACC auto sum() const
+ {
+ return x+y;
+ }
+};
+
+template<typename T>
+ALPAKA_FN_ACC VecTT<T> operator*(VecTT<T> const& lhs, VecTT<T> const& rhs)
+{
+ /* to avoid allocation side effects the result is always a vector
+ * with default policies*/
+ VecTT<T> result(lhs);
+ result.x *= rhs.x;
+ result.y *= rhs.y;
+ return result;
+}
+
struct DotKernel
{
template<typename TAcc, typename T>
@@ -197,9 +219,19 @@ struct DotKernel
auto const [local_i] = alpaka::getIdx<alpaka::Block, alpaka::Threads>(acc);
auto const [totalThreads] = alpaka::getWorkDiv<alpaka::Grid, alpaka::Threads>(acc);
- T thread_sum = 0;
- for(; i < arraySize; i += totalThreads)
- thread_sum += a[i] * b[i];
+ //using Dim = alpaka::DimInt<2>;
+
+ using VecT = VecTT<T>; //alpaka::Vec<Dim,T>;
+
+ T thread_sum = {0};
+ for(; i < arraySize / 2; i += totalThreads)
+ {
+ auto const valA = reinterpret_cast<VecT const*>(a)[i];
+ auto const valB = reinterpret_cast<VecT const*>(b)[i];
+ auto const val = valA * valB;
+ thread_sum += val.sum();
+ }
+
tb_sum[local_i] = thread_sum;
auto const [blockDim] = alpaka::getWorkDiv<alpaka::Block, alpaka::Threads>(acc);
@psychocoderHPC that's a valueable insight! I did not study the HW arch of Intel GPUs yet, so I don't know how their load/store units work and whether they benefit from having more instructions per loop iteration. The latter was often done for CUDA by loop unrolling.
In any case, this should be somewhat equivalent to your solution:
T thread_sum = 0;
for(; i < arraySize; i += totalThreads*2)
thread_sum += a[i] * b[i] + a[i+1] * b[i+1];
Also loads two adjacent values per thread instead of one.
We merged the Babelstream benchmark in #1846. I am sure there is some optimization potential left on the table. Here are some numbers from my GeForce RTX 2060:
The reported bandwidth is somewhere around 300GB/s, but the device has a bandwidth of 336GB/s.
Most kernels are also implemented to process a single
double
value per thread, which may be a waste. I (naively) tried processing a few elements per thread without success. I also tried using different load/store intrinsics. In particular__ldcs
and__stcs
for streaming access, also without improvements.I would welcome if any of our GPU experts could have a look if something could be improved there.