[SSE2] poor performance with emulated _mm_extract_epi8

Quuxplusone commented 8 years ago


Bugzilla Link	PR27737
Status	NEW
Importance	P normal
Reported by	Ivan (rozhuk.im@gmail.com)
Reported on	2016-05-13 14:11:44 -0700
Last modified on	2019-06-01 03:24:07 -0700
Version	3.8
Hardware	PC All
CC	benny.kra@gmail.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, spatel+llvm@rotateright.com
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also

I have some code that heavy uses _mm_extract_epi8().
Then I build with clang 3.8 and -msse2 (without -msse4.1) then program work
very slow.

To build with GCC and clang 3.4, 3.6, 3.7 I use macro:

#ifndef _mm_extract_epi8 /* SSE4.1 required. */
#define _mm_extract_epi8(__xmm, __n)                    \
    ((_mm_extract_epi16(__xmm, ((__n) >> 1)) >> (8 * ((__n) & 1))) & 0xff)
#endif

Test results:

AMD Athlon(tm) 5350 APU with Radeon(tm) R3      (2050.04-MHz K8-class CPU)
GCC:        20391006000 (SSE4.1) /  20116413000 (SSE2)
clang 3.8:  22329895000 (SSE4.1) / 117304135000 (SSE2) !!!
clang 3.7:  22367008000 (SSE4.1) /  25542571000 (SSE2)
clang 3.6:  22306648000 (SSE4.1) /  25914115000 (SSE2)
clang 3.4:  23684031000 (SSE4.1) /  25914115000 (SSE2)

Intel(R) Core(TM)2 Duo CPU     E8400  @ 3.00GHz (2999.72-MHz K8-class CPU)
GCC:        12031595000 (SSE4.1) / 12011303000 (SSE2)
clang 3.8:  12431116000 (SSE4.1) / 73035466000 (SSE2) !!!
clang 3.7:  12458839000 (SSE4.1) / 13317058000 (SSE2)
clang 3.6:  12462181000 (SSE4.1) / 14119683000 (SSE2)
clang 3.4:  13555167000 (SSE4.1) / 13178893000 (SSE2)

Quuxplusone commented 8 years ago

Without additional info about the code that's actually being run this bug report isn't useful.

Quuxplusone commented 8 years ago

Looking at this now.

The problem is that we're currently missing pre-SSE41 custom lowering for extractelement i8 - so we end up spilling the vector to stack and loading the byte (which can be particularly costly).

Quuxplusone commented 8 years ago

(In reply to comment #1)
> Without additional info about the code that's actually being run this bug
> report isn't useful.

http://www.netlab.linkpc.net/download/software/SDK/core/include/gost3411-2012.h
+
    struct rusage ru_st, ru_end;
    size_t i, cnt = 10;
    gost3411_2012_ctx_t ctx;
    uint8_t digest[GOST3411_2012_HASH_MAX_SIZE];
    uint8_t *data;
    size_t data_size = (100 * 1024 * 1024);

    data = malloc((data_size + 32));
    memset(data, 0xaa, (data_size + 32));
    data += 1;

    getrusage(RUSAGE_SELF, &ru_st);
    for (i = 0; i < cnt; i ++) {
        gost3411_2012_init(512, &ctx);
        gost3411_2012_update(&ctx, data, data_size);
        gost3411_2012_final(&ctx, digest);
    }
    getrusage(RUSAGE_SELF, &ru_end);
...

But I made fix in my code:
#ifndef __SSE4_1__ /* SSE4.1 required. */
#undef _mm_extract_epi8 /* CLang 3.8 perfomance fix. */
#define _mm_extract_epi8(__xmm, __n)                    \
    (0xff & (_mm_extract_epi16(__xmm, ((__n) >> 1)) >> (8 * ((__n) & 0x01))))
#endif

comment/delete it to reproduce bug.

Quuxplusone commented 7 years ago

D29841/rL297568 improved v16i8 extraction if its the only use of the vector - multiple extractions still go via a stack spill.

Quuxplusone / LLVMBugzillaTest

[SSE2] poor performance with emulated _mm_extract_epi8 #27736