WojciechMula / toys

Storage for my snippets, toy programs, etc.
BSD 2-Clause "Simplified" License
316 stars 38 forks source link

Scalar BMI2 for decoding base64 is not being run #2

Closed nkurz closed 8 years ago

nkurz commented 8 years ago

Hi Wojciech --

Thanks for publishing this. I was briefly confused that the scalar BMI2 speed was the same as the SSE BMI2 speed, and then noticed that the test was actually calling the same function twice.

--nate

diff --git a/base64/decode/sse/speed.cpp b/base64/decode/sse/speed.cpp
index b5defa1..95881db 100644
--- a/base64/decode/sse/speed.cpp
+++ b/base64/decode/sse/speed.cpp
@@ -33,7 +33,7 @@ public:

 #if defined(HAVE_BMI2_INSTRUCTIONS)
         if (cmd.empty() || cmd.has("scalar_bmi2")) {
-            measure("scalar & BMI2", base64::sse::decode_bmi2, reference);
+            measure("scalar & BMI2", base64::scalar::decode_lookup1_bmi2, reference);
         }
 #endif
nkurz commented 8 years ago

Also, it seems like the speedup for the scalar BMI2 is due solely to having one 32-bit write rather than 3 8-bit writes.

-                *out++ = b0 | (b1 << 6);
-                *out++ = (b1 >> 2) | (b2 << 4);
-                *out++ = (b2 >> 4) | (b3 << 2);
+                uint32_t dword = b0 | (b1 << 6) | (b2 << 12) | (b3 << 18);
+                *reinterpret_cast<uint32_t*>(out) = dword;
+                out += 3;

After that patch, this is what I see on a Skylake i7-6700 CPU @ 3.40GHz:

nate@skylake:~/git/WojciechMula-toys/base64/decode/sse$ ./speed
input size: 67108864
     improved scalar... 0.024
              scalar... 0.041 (speed up: 0.59)
       scalar & BMI2... 0.041 (speed up: 0.58)
                 SSE... 0.018 (speed up: 1.33)
          SSE & BMI2... 0.016 (speed up: 1.50)

By the way, it would be helpful if you specified more about the CPU you are running the tests on. In particular, the generation (Nehalem, Sandy Bridge, Haswell, Skylake, etc) can be very useful knowledge.

WojciechMula commented 8 years ago

Hi, thanks a lot for the report.

I've also noticed that the single write is the reason of boost, but unfortunately I have no permanent access to Core i7 (Haswell) to verify that.

WojciechMula commented 8 years ago

Thanks, I've finally fixed that mistake in speed prog. :)