WojciechMula / simd-byte-lookup

SIMDized check which bytes are in a set
http://0x80.pl/articles/simd-byte-lookup.html
BSD 2-Clause "Simplified" License
28 stars 3 forks source link

No mention of the SSE4.2 string operations is a bit surprising #1

Open thomcc opened 4 years ago

thomcc commented 4 years ago

(Note: This isn't an "issue" in the normal sense -- I was going to send it as an email, but having the discussion recorded on github might help someone who comes across it in the future and has the same question)

Is there a reason there's no mention of the SSE4.2 string operations in the blog post (e.g. pcmp[ei]str[im])?

Admittedly, your blog post would be much less interesting if it was just a "here's how to use this instruction" post, and that would be really unfortunate – as it is, your article is a great example of how to "think in SIMD".

Moreover, the technique you describe in it expands to wider vector sizes better, and can be easily translated other vector ISAs (neon, etc). It also won't have some of the issues that exist in the sse4.2 ops[0]. So I'm certainly not saying you should change the article to be a pcmpestri tutorial!

I'm mostly jut surprised it's not mentioned at all, given that SSE4.2 has dedicated support for testing byte set membership (via the _SIDD_CMP_EQUAL_ANY option on those instruction). I'd expect on a lot of machines for it to perform better in practice too, at least for sufficiently large bytesets/haystacks (But I could be very wrong!)

Anyway given how thorough your post is, it seems extremely likely you've thought about this, so I figured I'd ask.

Thanks a ton either way, the blog post was a delight to read.


[0]: In particular, the microcode they expand to has branches, and microcode branches aren't (weren't?) branch predicted, although my experience has been that this is more of a theoretical concern than a practical one. That said, I feel like people always bring it up.

sharpobject commented 2 years ago

pcmpestri seems quite expensive according to the latency numbers at https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html. I would be surprised if we could find a problem where it's correct to use it.