IBM Linux on z SIMD optimization

edelsohn commented 3 years ago

Is your feature request related to a problem? Please describe. IBM z architecture provides SIMD capabilities that can be utilized for zstd optimization similar to SSE and Neon SIMD optimizations that have been contributrd to zstd.

Describe the solution you'd like Optimize zstd to utilize IBMz VX SIMD intrinsics in zstd_lazy.c and ztsd_compress.c equivalent to optimization for SSE and Neon.

Describe alternatives you've considered Because zstd has implemented architecture-specific optimizations for other architectures, a hand-coded implementation has been shown to provide unique benefits above auto-vectorization.

Additional context Financial bounty from IBM available for negotiation. Inquiries from interested developers welcome.

danlark1 commented 3 years ago

Given unaligned loads and bit extraction instructions started to be available only from power8, is it reasonable to support only power8+ platforms?

aqrit commented 3 years ago

hashTags can be aligned to a 16-byte boundary (for free) by setting ZSTD_ROW_HASH_TAG_OFFSET to 16.

when n == 32 they could be aligned to 32 with some easy changes.

danlark1 commented 3 years ago

ZSTD_memcpy with constant argument is perfectly doing the trick, the savings will come only from movemask

edelsohn commented 3 years ago

We also can discuss the possibility of optimizations for IBM Power architecture with VSX SIMD.

This issue specifically is for the IBMz aka s390x aka mainframe.

danlark1 commented 3 years ago

We also can discuss the possibility of optimizations for IBM Power architecture with VSX SIMD.

This issue specifically is for the IBMz aka s390x aka mainframe.

Thanks for clarifying, my bad

edelsohn commented 3 years ago

The SIMD compiler intrinsics are very similar for both Power and z, so it's almost as easy to implement both as it is to implement one. And the x86 SSE intrinsics compatibility headers can provide some initial hints.

aqrit commented 3 years ago

The scalar fallback has been greatly improved in the dev branch. Vectorization of ZSTD_row_getMatchMask() is now expected to be worth less than 5%.

AFAIK, a IBMz VX SIMD implementation might look something like (with the z13 implementation being just a shot in the dark):

#if defined(__s390x__) && defined(__VEC__) && (__ARCH__ >= 11)
#include <vecintrin.h>
#endif

#if defined(__s390x__) && defined(__VEC__) && (__ARCH__ >= 12)
unsigned z14_i8x16_getMatchMask(const unsigned char* ptr, unsigned char tag) {
    const __vector unsigned char idx = (__vector unsigned char) {
        0x78, 0x70, 0x68, 0x60, 0x58, 0x50, 0x48, 0x40,
        0x38, 0x30, 0x28, 0x20, 0x18, 0x10, 0x08, 0x00
    };

    __vector unsigned char mask = vec_cmpeq(vec_xl(0, ptr), vec_splats(tag));
    __vector unsigned int bitmask = (__vector unsigned int)vec_bperm_u128(mask, idx);
    return vec_extract(bitmask, 1);
}
#endif

#if defined(__s390x__) && defined(__VEC__) && (__ARCH__ == 11)
unsigned z13_i8x16_getMatchMask(const unsigned char* ptr, unsigned char tag) {
    const __vector unsigned int extractMagic = (__vector unsigned int) {
        0x08040201, 0x80402010, 0x08040201, 0x80402010,
    };

    __vector unsigned char t0 = vec_cmpeq(vec_xl(0, ptr), vec_splats(tag));
    __vector unsigned int t1 = (__vector unsigned int)vec_abs((__vector signed char)t0);
    __vector unsigned char t2 = (__vector unsigned char)vec_gfmsum(t1, extractMagic);
    return (((unsigned)vec_extract(t2, 12)) << 8) | vec_extract(t2, 4);
}
#endif

I wonder if the "find best match in bucket" function could be further improved:

Vectorize the range checks.
Vectorize the left-packing.
Vectorize the memcmp (ZSTD_count).
Replace the pre-fetching with gathers (maybe?).

facebook / zstd

IBM Linux on z SIMD optimization #2679