WojciechMula / toys

Storage for my snippets, toy programs, etc.
BSD 2-Clause "Simplified" License
316 stars 38 forks source link

Update Haswell results for sse-sumbytes int8_t to include sadbw variant #10

Closed mayeut closed 5 years ago

mayeut commented 5 years ago

Thanks for updating the results on other architectures as well. Here are the results on Haswell with the sadbw variant from the second commit of #8

Just a few remarks on your updated article:

To avoid overflow, after a fixed number of iterations (32), the local accumulator is extended to 32-bit values and added to the global, 32-bit accumulator.

The fixed number of iterations is 128 (128 * (-128 + -128) == -32768), 32 is the width of the input at each iteration.

On Haswell VPMADDUBSW method significantly outperforms other approaches. However, on Skylake and newer architectures is as fast (or as slow...) as other methods.

IMHO, that's not entirely true. I'd say that on Skylake and newer, it's comparable to the sadbw variant introduced in #8 (the one requiring only one VPSADBW instruction per iteration).

Which leads me to the latest point, the sadbw variant is not (yet) documented in the article while its results are present. I can try to explain that in a few sentences but I guess you saw well enough what it's doing and you'll do a far better job than me explaining it so that readers understand (thanks for your articles, I always enjoy reading them !). Anyway, I'll give it a try, feel free to include or entirely rewrite this explanation:

The sadbw variant mimics the uint8_t implementation from part 1 by adding a 128 offset to each signed input in order to get an unsigned input (from [-128,127] to [0, 255]). As a final step, we now need to substract 128 times the number of input elements.

WojciechMula commented 5 years ago

Thanks you for the corrections. I somehow missed the alternative SADBW, need to update the text. It also appeared that I invoked GCC with wrong options (without -funroll-loops), while clang does it by default. Thus it seems that timings from Skylake and newer CPUs have to be revisited again.

The algorithms for adding 16-bit values are also awaiting for a separate article.

BTW, do you have a tweeter account?

mayeut commented 5 years ago

No tweeter account.