Open edelsohn opened 3 years ago
Given unaligned loads and bit extraction instructions started to be available only from power8, is it reasonable to support only power8+ platforms?
hashTags
can be aligned to a 16-byte boundary (for free) by setting ZSTD_ROW_HASH_TAG_OFFSET
to 16
.
when n == 32
they could be aligned to 32 with some easy changes.
ZSTD_memcpy
with constant argument is perfectly doing the trick, the savings will come only from movemask
We also can discuss the possibility of optimizations for IBM Power architecture with VSX SIMD.
This issue specifically is for the IBMz aka s390x aka mainframe.
We also can discuss the possibility of optimizations for IBM Power architecture with VSX SIMD.
This issue specifically is for the IBMz aka s390x aka mainframe.
Thanks for clarifying, my bad
The SIMD compiler intrinsics are very similar for both Power and z, so it's almost as easy to implement both as it is to implement one. And the x86 SSE intrinsics compatibility headers can provide some initial hints.
The scalar fallback has been greatly improved in the dev
branch.
Vectorization of ZSTD_row_getMatchMask()
is now expected to be worth less than 5%.
AFAIK, a IBMz VX SIMD
implementation might look something like
(with the z13
implementation being just a shot in the dark):
#if defined(__s390x__) && defined(__VEC__) && (__ARCH__ >= 11)
#include <vecintrin.h>
#endif
#if defined(__s390x__) && defined(__VEC__) && (__ARCH__ >= 12)
unsigned z14_i8x16_getMatchMask(const unsigned char* ptr, unsigned char tag) {
const __vector unsigned char idx = (__vector unsigned char) {
0x78, 0x70, 0x68, 0x60, 0x58, 0x50, 0x48, 0x40,
0x38, 0x30, 0x28, 0x20, 0x18, 0x10, 0x08, 0x00
};
__vector unsigned char mask = vec_cmpeq(vec_xl(0, ptr), vec_splats(tag));
__vector unsigned int bitmask = (__vector unsigned int)vec_bperm_u128(mask, idx);
return vec_extract(bitmask, 1);
}
#endif
#if defined(__s390x__) && defined(__VEC__) && (__ARCH__ == 11)
unsigned z13_i8x16_getMatchMask(const unsigned char* ptr, unsigned char tag) {
const __vector unsigned int extractMagic = (__vector unsigned int) {
0x08040201, 0x80402010, 0x08040201, 0x80402010,
};
__vector unsigned char t0 = vec_cmpeq(vec_xl(0, ptr), vec_splats(tag));
__vector unsigned int t1 = (__vector unsigned int)vec_abs((__vector signed char)t0);
__vector unsigned char t2 = (__vector unsigned char)vec_gfmsum(t1, extractMagic);
return (((unsigned)vec_extract(t2, 12)) << 8) | vec_extract(t2, 4);
}
#endif
I wonder if the "find best match in bucket" function could be further improved:
memcmp
(ZSTD_count).
Is your feature request related to a problem? Please describe. IBM z architecture provides SIMD capabilities that can be utilized for zstd optimization similar to SSE and Neon SIMD optimizations that have been contributrd to zstd.
Describe the solution you'd like Optimize zstd to utilize IBMz VX SIMD intrinsics in zstd_lazy.c and ztsd_compress.c equivalent to optimization for SSE and Neon.
Describe alternatives you've considered Because zstd has implemented architecture-specific optimizations for other architectures, a hand-coded implementation has been shown to provide unique benefits above auto-vectorization.
Additional context Financial bounty from IBM available for negotiation. Inquiries from interested developers welcome.