alpaka-group / alpaka

Abstraction Library for Parallel Kernel Acceleration :llama:
https://alpaka.readthedocs.io
Mozilla Public License 2.0
355 stars 74 forks source link

Optimize Babelstream for GPU #1874

Open bernhardmgruber opened 1 year ago

bernhardmgruber commented 1 year ago

We merged the Babelstream benchmark in #1846. I am sure there is some optimization potential left on the table. Here are some numbers from my GeForce RTX 2060:

BabelStream
Version: 4.0
Implementation: alpaka/LLAMA
Running kernels 100 times
Precision: double
Array size: 268.4 MB (=0.3 GB)
Total size: 805.3 MB (=0.8 GB)
Using alpaka device NVIDIA GeForce RTX 2060
Function    MBytes/sec  Min (sec)   Max         Average     
Copy        297715.310  0.00180     0.00256     0.00194     
Mul         297806.305  0.00180     0.00632     0.00198     
Add         303741.451  0.00265     0.00548     0.00295     
Triad       304093.569  0.00265     0.00349     0.00285     
Dot         311901.594  0.00172     0.00210     0.00177  

The reported bandwidth is somewhere around 300GB/s, but the device has a bandwidth of 336GB/s.

Most kernels are also implemented to process a single double value per thread, which may be a waste. I (naively) tried processing a few elements per thread without success. I also tried using different load/store intrinsics. In particular __ldcs and __stcs for streaming access, also without improvements.

I would welcome if any of our GPU experts could have a look if something could be improved there.

j-stephan commented 1 year ago

NVIDIA often advises that 70% - 80% of the theoretical bandwidth are already quite good. The benchmark here achieves 90%. I'm not sure how worthwhile this endeavour would be and how much there is left to gain without spending countless hours for micro-optimizations.

Edit: I would also suggest to test this on an HPC-grade GPU.

bernhardmgruber commented 1 year ago

NVIDIA often advises that 70% - 80% of the theoretical bandwidth are already quite good.

I guess that refers to kernels in general. Babelstream is almost pure loads and stores, so I assume we can get closer here.

Edit: I would also suggest to test this on an HPC-grade GPU.

I can try, sure!

bernhardmgruber commented 1 year ago

V100:

BabelStream
Version: 4.0
Implementation: alpaka
Running kernels 100 times
Precision: double
Array size: 268.4 MB (=0.3 GB)
Total size: 805.3 MB (=0.8 GB)
Using alpaka device Tesla V100-SXM2-32GB
Function    MBytes/sec  Min (sec)   Max         Average     
Copy        789949.358  0.00068     0.00069     0.00068     
Mul         786742.779  0.00068     0.00069     0.00068     
Add         810901.589  0.00099     0.00101     0.00100     
Triad       812557.632  0.00099     0.00101     0.00100     
Dot         822284.067  0.00065     0.00066     0.00066

The GPU's bandwidth according to the specification is 897.0 GB/s, so we are similarly close. The dot kernel is actually the fastest and it is the only one using a grid-strided loop. This could be a hint that the other kernels should also be grid-strided.

bernhardmgruber commented 1 year ago

Btw, this is how we compare with the BabelStream CUDA: image

bernhardmgruber commented 1 year ago

I tried using grid strided loops today and they actually worsen performance. Tested on RTX 2060 and A100.

bernhardmgruber commented 10 months ago

The alpaka version with SYCL backend compares bad against the SYCL BabelStream (upstream):

image

All BabelStream versions ran with babelstream --float -s $((2**30)). If multiple devices are available (e.g. when running upstream BabelStream), make sure to use the oneAPI Level Zero platform and not OpenCL. The latter crashes beyond size of 2**29.

[update from @psychocoderHPC] Tested on ARC770.

bernhardmgruber commented 10 months ago

For HIP, we are fine:

image

psychocoderHPC commented 10 months ago

For Intel SYCL it makes sense to load data as vector 2 e.g. float2

#  --mibibytes is required because of the hacky decision of 2 in the load data loop
./example/babelstream/babelstream  --float -s $((2**29))  --mibibytes

# original dev results on ARC770
Dot         277580.082  0.01476     0.01640     0.01490 
# after the tiny hacky patch
Dot         395646.154  0.01035     0.01177     0.01043   
diff --git a/example/babelstream/src/AlpakaStream.cpp b/example/babelstream/src/AlpakaStream.cpp
index bca66c7cfbe..27495761727 100644
--- a/example/babelstream/src/AlpakaStream.cpp
+++ b/example/babelstream/src/AlpakaStream.cpp
@@ -185,6 +185,28 @@ void AlpakaStream<T>::nstream()
     alpaka::wait(queue);
 }

+template<typename Type>
+struct VecTT
+{
+    Type x;
+    Type y;
+    ALPAKA_FN_ACC auto sum() const
+    {
+        return x+y;
+    }
+};
+
+template<typename T>
+ALPAKA_FN_ACC VecTT<T> operator*(VecTT<T> const& lhs, VecTT<T> const& rhs)
+{
+    /* to avoid allocation side effects the result is always a vector
+     * with default policies*/
+    VecTT<T> result(lhs);
+    result.x *= rhs.x;
+    result.y *= rhs.y;
+    return result;
+}
+
 struct DotKernel
 {
     template<typename TAcc, typename T>
@@ -197,9 +219,19 @@ struct DotKernel
         auto const [local_i] = alpaka::getIdx<alpaka::Block, alpaka::Threads>(acc);
         auto const [totalThreads] = alpaka::getWorkDiv<alpaka::Grid, alpaka::Threads>(acc);

-        T thread_sum = 0;
-        for(; i < arraySize; i += totalThreads)
-            thread_sum += a[i] * b[i];
+        //using Dim = alpaka::DimInt<2>;
+
+        using VecT = VecTT<T>; //alpaka::Vec<Dim,T>;
+
+        T thread_sum = {0};
+        for(; i < arraySize / 2; i += totalThreads)
+        {
+            auto const valA = reinterpret_cast<VecT const*>(a)[i];
+            auto const valB = reinterpret_cast<VecT const*>(b)[i];
+            auto const val = valA * valB;
+            thread_sum += val.sum();
+        }
+
         tb_sum[local_i] = thread_sum;

         auto const [blockDim] = alpaka::getWorkDiv<alpaka::Block, alpaka::Threads>(acc);
bernhardmgruber commented 10 months ago

@psychocoderHPC that's a valueable insight! I did not study the HW arch of Intel GPUs yet, so I don't know how their load/store units work and whether they benefit from having more instructions per loop iteration. The latter was often done for CUDA by loop unrolling.

In any case, this should be somewhat equivalent to your solution:

T thread_sum = 0;
for(; i < arraySize; i += totalThreads*2)
    thread_sum += a[i] * b[i] + a[i+1] * b[i+1];

Also loads two adjacent values per thread instead of one.