Closed HJLebbink closed 6 years ago
I didn't consider this trait. Do you think it wouldn't impact on performance? Does a compiler always inline a call?
ICC18 does not automatically inline in my code. I added a __forceinline to the signature to ensure the call is indeed inlined.
template<> __forceinline __m128i ternary<0x04>(const __m128i A, const __m128i B, const __m128i C)
And I call these methods from a large case-switch to get rid to the template variable.
forceinline m512i ternary(const m512i a, const m512i b, const __m512i c, const int i) { switch (i) { case 0: return _mm512_ternarylogic_epi64(a, b, c, 0); case 1: return _mm512_ternarylogic_epi64(a, b, c, 1); case 2: return _mm512_ternarylogic_epi64(a, b, c, 2); case 3: return ternary(a, c, b, 2); case 4: return ternary(a, c, b, 3); ...
This allows the compiler to optimize further, but those optimizations may be specific to my application (in which I brute-force search all combinations of vpternlogs).
I may be relevant for your project simply because you now only need to curate 80 functions instead of 256.
I should have known it when I was manually optimizing functions. :) Now the number of functions is irrelevant as all optimizations are performed by scripts.
Could you please elaborate more about your application? If you don't like to share it here, feel free to write me email at wojciech_mula@poczta.onet.pl.
See this question on StackOverflow on why I'm studying the vpternlog instruction. I've gitted the file ternary_logic.cpp in my fork of this code. This file contains convenience methods that you may find handy or that you could add to your code.
Is there a particular reason for not reusing parameter permutation equivalent operations. I realized that my compiler only very occasionally reduces equivalent operations. The current 256 operations can be reduced to 80. For example:
I'll give a look at the python code to see if I can use the following equivalences: