alpaka-group / alpaka

Abstraction Library for Parallel Kernel Acceleration :llama:
https://alpaka.readthedocs.io
Mozilla Public License 2.0
356 stars 74 forks source link

bable stream benchmark broken #2419

Open psychocoderHPC opened 2 weeks ago

psychocoderHPC commented 2 weeks ago

Your current bablestream implementation in the last release 1.2.0 is broken. With #2299 the behaviour of the benchmark was changed.

Upstream bablestream is commuting the following code:

init:    a, b, c         // set to init value a = 0.1 ; b = 0.2 ; c = 0.0
copy:    c = a
mul:     b = scalar * c  // with scalar = 0.4
add:     c = a + b
triad:   a = b + scalar * c
nstream: a += b + scalar * c
dot:     sum = a * b     // reduce 

verification: check errSUm = abs((sum - goldSum)/goldSum)

Our current implementation:

init:    a, b, c         // set to init value a = 1.0 ; b = 0.0 ; c = 0.0
copy:    b = a
mul:     b = scalar * a  // with scalar = 2.0
add:     c = a + b
triad:   c = a + scalar * b
nstream: missing
dot:     sum = a * b     // reduce 

verification: 
  check if c == initA + scalar * scalar * initA
  check if b == scalar * initA
  check if a == initA

Note that Nstream is in the upstream code optional but we implemented it in the version before #2299

The problem is that you can not use the current implementation to compare the performance against other implementation because we implement the corresponding stream function but with different access orders to arrays which can affect the performance because of caching effects. This should not be a big issue for large arrays. The copy kernel can be removed or wrongly implemented and the verification is still reporting all results are correct. The reason for this behaviour is that copy set b and the next mul operation is overwriting the result. The same issue happened with the add operation. In our implementation array a is never changed except in the init.

The upstream code is designed so that operations depend on each other, changing the order or making a mistake in the implementation will result in an error in the validation.

WARNING please do not use the bablestream implementation of alpaka release 1.2.0 for comparison against other bablestream implementation!

SimeonEhrig commented 2 weeks ago

I agree with @psychocoderHPC. In theory it could be possible, that the changed kernel do the same calculations like before, but this a really huge work to prove it. Therefore it is much easier to change the kernels like it was before that we are sure, that our benchmarks a comparable with the reference implementation.

Also the init values could be problematic. Who guaranties use, that the out-of-order execution of a CPU dose not skip the addition with 0 or the multiplication with 1 because it necessary. It is possible, that our implementation of the multiplication kernel is actual the same like the copy kernel, because each value is multiplied with 1.

mehmetyusufoglu commented 1 week ago

@psychocoderHPC 1. current alpaka sums errors for verification. Your definition of error above, describing current situation, is not correct.

  1. Secondly, Init values of b and c are not important they are overwritten.
  2. Third, the caching effect you are talking about valid but it is not observed in the results at different machines therefore i did not notice it. [Order is important as far as I saw; even AMD used the Univ Bristol repository so kernel run order is standard. ]
  3. The kernel nstream is not included in babelstream benchmark. It is optional and not one of 5 babelstream standard benchmark kernels. Nobody knows why it is there, it is not run and displayed at the results. Thanks i mostly fixed.
mehmetyusufoglu commented 1 week ago

I agree with @psychocoderHPC. In theory it could be possible, that the changed kernel do the same calculations like before, but this a really huge work to prove it. Therefore it is much easier to change the kernels like it was before that we are sure, that our benchmarks a comparable with the reference implementation.

Also the init values could be problematic. Who guaranties use, that the out-of-order execution of a CPU dose not skip the addition with 0 or the multiplication with 1 because it necessary. It is possible, that our implementation of the multiplication kernel is actual the same like the copy kernel, because each value is multiplied with 1.

@SimeonEhrig Since scalar is 2 multiplication is not equal to copy but i understand your point. Thanks.