Open Quuxplusone opened 8 years ago
Bugzilla Link | PR27737 |
Status | NEW |
Importance | P normal |
Reported by | Ivan (rozhuk.im@gmail.com) |
Reported on | 2016-05-13 14:11:44 -0700 |
Last modified on | 2019-06-01 03:24:07 -0700 |
Version | 3.8 |
Hardware | PC All |
CC | benny.kra@gmail.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, spatel+llvm@rotateright.com |
Fixed by commit(s) | |
Attachments | |
Blocks | |
Blocked by | |
See also |
Without additional info about the code that's actually being run this bug report isn't useful.
Looking at this now.
The problem is that we're currently missing pre-SSE41 custom lowering for extractelement i8 - so we end up spilling the vector to stack and loading the byte (which can be particularly costly).
(In reply to comment #1)
> Without additional info about the code that's actually being run this bug
> report isn't useful.
http://www.netlab.linkpc.net/download/software/SDK/core/include/gost3411-2012.h
+
struct rusage ru_st, ru_end;
size_t i, cnt = 10;
gost3411_2012_ctx_t ctx;
uint8_t digest[GOST3411_2012_HASH_MAX_SIZE];
uint8_t *data;
size_t data_size = (100 * 1024 * 1024);
data = malloc((data_size + 32));
memset(data, 0xaa, (data_size + 32));
data += 1;
getrusage(RUSAGE_SELF, &ru_st);
for (i = 0; i < cnt; i ++) {
gost3411_2012_init(512, &ctx);
gost3411_2012_update(&ctx, data, data_size);
gost3411_2012_final(&ctx, digest);
}
getrusage(RUSAGE_SELF, &ru_end);
...
But I made fix in my code:
#ifndef __SSE4_1__ /* SSE4.1 required. */
#undef _mm_extract_epi8 /* CLang 3.8 perfomance fix. */
#define _mm_extract_epi8(__xmm, __n) \
(0xff & (_mm_extract_epi16(__xmm, ((__n) >> 1)) >> (8 * ((__n) & 0x01))))
#endif
comment/delete it to reproduce bug.
D29841/rL297568 improved v16i8 extraction if its the only use of the vector - multiple extractions still go via a stack spill.