General discussion about XXH3

easyaspi314 commented 5 years ago

This is going to be a tracker for discussion, questions, feedback, and analyses about the new XXH3 hashes, found in the xxh3 branch.

@Cyan4973's comments (from xxhash.h):

XXH3 is a new hash algorithm, featuring vastly improved speed performance for both small and large inputs.

A full speed analysis will be published, it requires a lot more space than this comment can handle.

In general, expect XXH3 to run about ~2x faster on large inputs, and >3x faster on small ones, though exact difference depend on platform.

The algorithm is portable, will generate the same hash on all platforms. It benefits greatly from vectorization units, but does not require it.

XXH3 offers 2 variants, _64bits and _128bits. The first 64-bits field of the _128bits variant is the same as _64bits result. However, if only 64-bits are needed, prefer calling the _64bits variant. It reduces the amount of mixing, resulting in faster speed on small inputs.

The XXH3 algorithm is still considered experimental. It's possible to use it for ephemeral data, but avoid storing long-term values for later re-use. While labelled experimental, the produced result can still change between versions.

The API currently supports one-shot hashing only. The full version will include streaming capability, and canonical representation Long term optional feature may include custom secret keys, and secret key generation.

There are still a number of opened questions that community can influence during the experimental period. I'm trying to list a few of them below, though don't consider this list as complete.

128-bits output type : currently defined as a structure of 2 64-bits fields.
- That's because 128-bit values do not exist in C standard.
- Note that it means that, at byte level, result is not identical depending on endianess.
- However, at field level, they are identical on all platforms.
- The canonical representation will solve the issue of identical byte-level representation across platforms, which is necessary for serialization.
- Would there be a better representation for a 128-bit hash result ?
- Are the names of the inner 64-bit fields important ? Should they be changed ?
Canonical representation : for the 64-bit variant, canonical representation is the same as XXH64() (aka big-endian).
- What should it be for the 128-bit variant ?
- Since it's no longer a scalar value, big-endian representation is no longer an obvious choice.
- One possibility : represent it as the concatenation of two 64-bits canonical representation (aka 2x big-endian)
- Another one : represent it in the same order as natural order in the struct for little-endian platforms.
- Less consistent with existing convention for XXH32/XXH64, but may be more natural for little-endian platforms.
Associated functions for 128-bit hash : simple things, such as checking if 2 hashes are equal, become more difficult with struct.
- Granted, it's not terribly difficult to create a comparator, but it's still a workload.
- Would it be beneficial to declare and define a comparator function for XXH128_hash_t ?
- Are there other operations on XXH128_hash_t which would be desirable ?
Variant compatibility : The first 64-bit field of the _128bits variant is the same as the result of _64bits.
- This is not a compulsory behavior. It just felt that it "wouldn't hurt", and might even help in some (unidentified) cases.
- But it might influence the design of XXH128_hash_t, in ways which may block other possibilities.
- Good idea, bad idea ?
Seed type for 128-bits variant : currently, it's a single 64-bit value, like the 64-bit variant.
- It could be argued that it's more logical to offer a 128-bit seed input parameter for a 128-bit hash.
- Although it's also more difficult to use, since it requires to declare and pass a structure instead of a value.
- It would either replace current choice, or add a new one.
- Farmhash, for example, offers both variants (the 128-bits seed variant is called doubleSeed).
- If both 64-bit and 128-bit seeds are possible, which variant should be called XXH128?
Result for len==0 : Currently, the result of hashing a zero-length input is the seed.
- This mimics the behavior of a crc : in which case, a seed is effectively an accumulator, so it's not updated if input is empty.
- Consequently, by default, when no seed specified, it returns zero. That part seems okay (it used to be a request for XXH32/XXH64).
- But is it still fine to return the seed when the seed is non-zero ?
- Are there use case which would depend on this behavior, or would prefer a mixing of the seed ?

Zhentar commented 5 years ago

Translating my AVX2 optimizations to SSE2 got it nice and performant for me. Unfortunately scalar continues to kick my ass. Think I'll call it good enough for now & wait for the updated algorithm to be a bit more finalized before evaluating & optimizing short keys.

Method	ByteLength	Mean	Throughput
xxHash64_Scalar	102400	8.390 us	11,639.8 MB/s
xxHash3_AVX2	102400	2.444 us	39,959.6 MB/s
xxHash3_SSE2	102400	4.917 us	19,859.1 MB/s
xxHash3_Scalar	102400	23.433 us	4,167.5 MB/s

( @easyaspi314 - it's effectively a completely new set of SSE2 code, so there's a fair chance whatever caused your hang has resolved itself )

easyaspi314 commented 5 years ago

Huh. Apparently, System.Runtime.Intrinsics.X86.Bmi2.X64.MultiplyNoFlags will cause a hang on the first preview. I had to find that out the hard way because apparently the debugger hates me and won't let me pause.

I updated to preview3, and it worked fine. I honestly didn't even realize I was on the wrong preview.

// * Summary *

BenchmarkDotNet=v0.11.4, OS=macOS Mojave 10.14.2 (18C54) [Darwin 18.2.0]
Intel Core i7-2635QM CPU 2.00GHz (Sandy Bridge), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100-preview3-010431
  [Host]     : .NET Core 3.0.0-preview3-27503-5 (CoreCLR 4.6.27422.72, CoreFX 4.7.19.12807), 64bit RyuJIT
  Job-RNHJPJ : .NET Core 3.0.0-preview3-27503-5 (CoreCLR 4.6.27422.72, CoreFX 4.7.19.12807), 64bit RyuJIT

MaxRelativeError=0.05  EnvironmentVariables=COMPlus_TieredCompilation=0

Method	ByteLength	Mean	Error	StdDev	Throughput
xxHash64_Scalar	102400	16.766 us	0.1894 us	0.1772 us	5,824.8 MB/s
xxHash3_AVX2	102400	6.826 us	0.0369 us	0.0345 us	14,306.4 MB/s
xxHash3_SSE2	102400	6.831 us	0.0400 us	0.0355 us	14,296.1 MB/s
xxHash3_Scalar	102400	34.039 us	0.3778 us	0.3534 us	2,868.9 MB/s

For the reference, I made this change, which I highly suggest you do:

#if NETCOREAPP3_0
                        if (System.Runtime.Intrinsics.X86.Avx2.IsSupported && UseAvx2)
                        {
                                LongSequenceHash_AVX2(ref acc2, data);
                        }
                        // fall back to SSE2 if we request AVX2 on an unsupported machine.
                        else if (System.Runtime.Intrinsics.X86.Sse2.IsSupported && (UseSse2 || UseAvx2))
                        {
                                LongSequenceHash_SSE2(ref acc2, data);
                        }
                        else
#endif
                        {
                                LongSequenceHash_Scalar(ref acc2, data);
                        }

easyaspi314 commented 5 years ago

Unrolling XXH64 8 times seems to help a little bit, although it could just be fluctuations.

Here are the results compared with native.

Method	ByteLength	Mean	Error	StdDev	Throughput
xxHash64_Scalar	102400	15.441 us	0.1430 us	0.1268 us	6,324.4 MB/s
xxHash3_SSE2	102400	6.912 us	0.0885 us	0.0785 us	14,129.3 MB/s
xxHash3_Scalar	102400	33.865 us	0.1445 us	0.1207 us	2,883.7 MB/s
XXH64_Native (`clang -march=sandybridge`, inline asm hack)	102400	12.470 us	0.0742 us	0.0694 us	7,831.0 MB/s
XXH3_Native (`clang -march=sandybridge`)	102400	6.185 us	0.0356 us	0.0315 us	15,789.8 MB/s
XXH64_Native (`clang -march=x86-64`, no inline asm)	102400	14.538 us	0.1167 us	0.1092 us	6,717.1 MB/s
XXH3_Native (`clang -march=x86-64`)	102400	6.323 us	0.0391 us	0.0347 us	15,444.9 MB/s

Note: The inline assembly hack was this:

static U64 XXH64_round(U64 acc, U64 input)
{
    __asm__(
        "imulq  %[PRIME64_2], %[input]\n"
        "addq   %[input], %[acc]\n"
        "shldq  $31, %[acc], %[acc]\n"
        "imulq  %[PRIME64_1], %[acc]"
        : [acc] "+r" (acc), [input] "+r" (input)
        : [PRIME64_1] "r" (PRIME64_1), [PRIME64_2] "r" (PRIME64_2)
    );
    return acc;
}

This is because while Clang will properly tune to use shld on Sandy Bridge (it has better throughput than rol for some reason), it improperly generates 4 excess mov instructions at the bottom of the loop for no reason.

Either way, that is very nice performance, good job!

Edit: How I interfaced with native:

[DllImport("/Users/user/xxh3-fix/libxxhash.dylib", CallingConvention = CallingConvention.Cdecl)]
internal extern static ulong XXH64(byte[] data, UIntPtr length, ulong seed);

[DllImport("/Users/user/xxh3-fix/libxxhash.dylib", CallingConvention = CallingConvention.Cdecl)]
internal extern static ulong XXH3_64bits(byte[] data, UIntPtr length);

[Benchmark]
public unsafe ulong XXH64_Native()
{
    return XXH64(_bytes, (UIntPtr)_bytes.Length, 0);
}

[Benchmark]
public unsafe ulong XXH3_Native()
{
    return XXH3_64bits(_bytes, (UIntPtr)_bytes.Length);
}

internal extern :thinking:

easyaspi314 commented 5 years ago

Scalar isn't that fast, tbh.

The reason XXH3 is so fast is because it takes advantage of SSE2 and NEON which are guaranteed to be supported in all modern desktop and mobile CPUs.

The scalar version was made to be decent, and it is pretty decent in most scenarios.

But, yeah, for perspective, this is what XXH3 looks like on the native one with no SSE2:

clang -march=x86-64 -O3 -DXXH_VECTORIZE=0 -mno-sse # had to remove for xxhsum

./xxhsum 0.7.0 (64-bits x86_64 + SSE2 little endian), Clang 8.0.0 (tags/RELEASE_800/final), by Yann Collet
Sample of 100 KB...
XXH32               :     102400 ->    42755 it/s ( 4175.3 MB/s)
XXH32 unaligned     :     102400 ->    42980 it/s ( 4197.3 MB/s)
XXH64               :     102400 ->    69066 it/s ( 6744.8 MB/s)
XXH64 unaligned     :     102400 ->    67172 it/s ( 6559.7 MB/s)
XXH3_64bits         :     102400 ->    47720 it/s ( 4660.1 MB/s)
XXH3_64b unaligned  :     102400 ->    45991 it/s ( 4491.3 MB/s)

clang -m32 -O3 -DXXH_VECTORIZE=0 -mno-sse

./xxhsum 0.7.0 (32-bits i386 little endian), Clang 8.0.0 (tags/RELEASE_800/final), by Yann Collet
Sample of 100 KB...
XXH32               :     102400 ->    35037 it/s ( 3421.6 MB/s)
XXH32 unaligned     :     102400 ->    34626 it/s ( 3381.4 MB/s)
XXH64               :     102400 ->    13667 it/s ( 1334.6 MB/s)
XXH64 unaligned     :     102400 ->    13618 it/s ( 1329.9 MB/s)
XXH3_64bits         :     102400 ->    24640 it/s ( 2406.2 MB/s)
XXH3_64b unaligned  :     102400 ->    24091 it/s ( 2352.7 MB/s)

Considering that it is C#, I wouldn't be too worried about it, because most people are going to be running on x86_64.

Zhentar commented 5 years ago

Thanks, that is a helpful point of comparison; glad to see the gap between my version and the native version is a fair bit smaller than I had thought. But unfortunately since the SSE2 & AVX2 code requires .NET Core 3.0, C# executed by Mono or the plain old .NET Framework (i.e. the vast majority it today) will be using the scalar version, so I am still going to concern myself with optimizing it as much as possible.

easyaspi314 commented 5 years ago

@Cyan4973 @42Bastian @Zhentar @sergeevabc @ifduyue

Wanna try this snippet and make sure it detects features properly on x86/x86_64? Try as many compilers as possible.

#include <stdio.h>
#include <string.h>

#if defined(_MSC_VER) && (defined(_M_IX86) || defined(_M_AMD64) || defined(_M_X64))
#  include <intrin.h>
#  define CPUIDEX __cpuidex
#  define CPUID __cpuid
#elif (defined(__GNUC__) || defined(__TINYC__)) && (defined(__i386__) || defined(__x86_64__))
static void CPUIDEX(int *cpuInfo, int function_id, int subfunction_id)
{
    int eax, ecx;
    eax = function_id;
    ecx = subfunction_id;
    __asm__ __volatile__("cpuid"
      : "+a" (eax), "=b" (cpuInfo[1]), "=c" (ecx), "=d" (cpuInfo[3]));
    cpuInfo[0] = eax;
    cpuInfo[2] = ecx;
}
static void CPUID(int *cpuInfo, int function_id)
{
    CPUIDEX(cpuInfo, function_id, 0);
}
#else
#   warning "x86 or x86_64 please, add your compiler to the macros if needed and tell me"
static void CPUIDEX(int *cpuInfo, int function_id, int subfunction_id)
{
    memset(cpuInfo, 0, 4 * sizeof(int));
}
static void CPUID(int *cpuInfo, int function_id)
{
    memset(cpuInfo, 0, 4 * sizeof(int));
}
#endif
int main() {
    int max, data[4];
    CPUID(data, 0);
    max = data[0];

    if (max >= 1) {
        CPUID(data, 1);
        printf("sse2: %d\n", !!(data[3] & (1 << 26)));
        printf("avx: %d\n", !!(data[2] & (1 << 28)));
     }
    if (max >= 7) {
        CPUID(data, 7);
        printf("avx2: %d\n", !!(data[1] & (1 << 5)));
    }
    return 0;
}

Cyan4973 commented 5 years ago

The collision analyzer has been put to good use during the week end. It's a very precise tool, which makes it possible to observe subtle difference which are otherwise impossible to notice with "standard" test tools (aka smhasher and al.).

This leads to several small but key modifications in the algorithm, bringing it in line with maximal theoretical collision rate.

As part of the changes, the function mul128_fold64 has been slightly modified, to use ^ for folding instead of +. I could update most variants except the aarch64 one : https://github.com/Cyan4973/xxHash/blob/xxh3/xxh3.h#L163

@easyaspi314 , would you mind having a look at it ? Thanks !

easyaspi314 commented 5 years ago

I have inserted a WIP CPU dispatcher into xxhsum. I have binaries for macOS and Linux here for both 32-bit and 64-bit. I am too lazy to start up my Windows PC, so lmk if you actually want a build.

The Linux binaries are static, and they should work in WSL.

xxhsum-binaries-linux-darwin.zip

If you run the benchmark, it should tell you what version of xxh3 it runs:

./xxhsum-darwin-i686 0.7.0 (32-bits i386 little endian), Clang 8.0.0 (tags/RELEASE_800/final), by Yann Collet
Sample of 100 KB...
XXH32               :     102400 ->    34108 it/s ( 3330.8 MB/s)
XXH32 unaligned     :     102400 ->    33667 it/s ( 3287.8 MB/s)
XXH64               :     102400 ->    13455 it/s ( 1314.0 MB/s)
XXH64 unaligned     :     102400 ->    13527 it/s ( 1321.0 MB/s)
1-XXH3_64bits       :     102400 ->
Using HashLong version __XXH3_HASH_LONG_SSE2

XXH3_64bits         :     102400 ->   158898 it/s (15517.4 MB/s)
XXH3_64b unaligned  :     102400 ->   158204 it/s (15449.6 MB/s)

If you have a CPU with AVX2, it should say __XXH3_HASH_LONG_AVX2.

Edit: It is to be noted that these were compiled for generic i686 and generic x86-64. These all work on my system.

easyaspi314 commented 5 years ago

Also, sure I'll take a look at the multiply code.

I don't know if it is worth it to do the inline assembly for that. Pretty much the only aarch64 compilers are GCC or LLVM-based, and therefore it should have the __uint128_t.

MSVC for ARM/aarch64 is pretty much dead (was it ever alive 😛?), and besides, msvc uses a different syntax.

easyaspi314 commented 5 years ago

Tested Windows binaries I cross-compiled with mingw. (Cross compiling for Windows? 🤯)

xxhsum-windows-binaries.zip reuploaded

Please test so I know that my dispatcher is working correctly; I currently have no idea if it is because I only have Sandy Bridge.

easyaspi314 commented 5 years ago

I'll test when I get home if my sister will ever give up her laptop. :joy:

Actually, works fine in wine.

Zhentar commented 5 years ago

It's picking SSE2 rather than AVX2 for me

easyaspi314 commented 5 years ago

Oh, I think I know the problem, it isn't clearing ecx. Here's a new one with Windows, Mac, and Linux binaries.

xxhsum-binaries-fix.zip

easyaspi314 commented 5 years ago

When I get home, I might see if I can find and boot my dinosaur Pentium III laptop with Windows 2000 to see if it chooses the scalar version just because.

easyaspi314 commented 5 years ago

Heh, the virtual desktop at my college has AVX2.

xxh3-windows-avx

Cyan4973 commented 5 years ago

I can run a Windows 10 VM from a Mac laptop at work. Here are some results :

.\xxhsum-windows-i686.exe -b5
xxhsum-windows-i686.exe 0.7.0 (32-bits i386 little endian), GCC 8.3.0, by Yann Collet
Sample of 100 KB...
1-XXH3_64bits       :     102400 ->
Using HashLong version __XXH3_HASH_LONG_AVX2

XXH3_64bits         :     102400 ->   512000 it/s (50000.0 MB/s)

.\xxhsum-windows-x86_64.exe -b5
xxhsum-windows-x86_64.exe 0.7.0 (64-bits x86_64 + SSE2 little endian), GCC 8.3.0, by Yann Collet
Sample of 100 KB...
1-XXH3_64bits       :     102400 ->
Using HashLong version __XXH3_HASH_LONG_AVX2

XXH3_64bits         :     102400 ->   513542 it/s (50150.5 MB/s)

So both versions correctly detect and use the AVX2 code path.

easyaspi314 commented 5 years ago

Cool. I'll make a branch.

By the way, I also made it so we only have to write the x86 code once with the power of…

( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( (M) (A) (C) (R) (O) (S) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) )


#define XXH_CONCAT_2(x, y) x##y
#define XXH_CONCAT(x, y) XXH_CONCAT_2(x, y)

/* These macros and typedefs make it so we only have to write the x86 code once.
 *                           SSE2               AVX2
 * XXH_vec           =     __m128i            __m256i
 * XXH_MM(add_epi32) =  _mm_add_epi32    _mm256_add_epi32
 * XXH_MM_SI(loadu)  = _mm_loadu_si128  _mm256_loadu_si256 */

#if XXH_VECTOR == XXH_AVX2
   typedef __m256i XXH_vec;
#  define XXH_MM(x)    XXH_CONCAT(_mm256_, x)
#  define XXH_MM_SI(x) XXH_MM(XXH_CONCAT(x, _si256))
#elif XXH_VECTOR == XXH_SSE2
   typedef __m128i XXH_vec;
#  define XXH_MM(x)    XXH_CONCAT(_mm_, x)
#  define XXH_MM_SI(x) XXH_MM(XXH_CONCAT(x, _si128))
#endif
/* lame, I know */
#define VEC_SIZE sizeof(XXH_vec)

XXH_FORCE_INLINE void
XXH3_accumulate_512(void *restrict acc, const void *restrict data, const void *restrict key)
{

#if (XXH_VECTOR == XXH_AVX2) || (XXH_VECTOR == XXH_SSE2)
    assert(((size_t)acc) & (VEC_SIZE - 1) == 0);
    {
        ALIGN(VEC_SIZE) XXH_vec* const xacc  =       (XXH_vec *) acc;
        const           XXH_vec* const xdata = (const XXH_vec *) data;
        const           XXH_vec* const xkey  = (const XXH_vec *) key;

        size_t i;
        for (i=0; i < STRIPE_LEN / VEC_SIZE; i++) {
            XXH_vec const d   = XXH_MM_SI(loadu) (xdata+i);
            XXH_vec const k   = XXH_MM_SI(loadu) (xkey+i);
            XXH_vec const dk  = XXH_MM_SI(xor) (d,k);                                  /* uint32 dk[8]  = {d0+k0, d1+k1, d2+k2, d3+k3, ...} */
            XXH_vec const res = XXH_MM(mul_epu32) (dk, XXH_MM(shuffle_epi32) (dk, 0x31));  /* uint64 res[4] = {dk0*dk1, dk2*dk3, ...} */
            XXH_vec const add = XXH_MM(add_epi64) (d, xacc[i]);
            xacc[i]  = XXH_MM(add_epi64) (res, add);
        }
    }
#else ...
}

XXH_FORCE_INLINE void
XXH3_scrambleAcc(void* restrict acc, const void* restrict key)
{
#if (XXH_VECTOR == XXH_AVX2) || (XXH_VECTOR == XXH_SSE2)

    assert(((size_t)acc) & (VEC_SIZE - 1) == 0);
    {
        ALIGN(VEC_SIZE) XXH_vec* const xacc =       (XXH_vec*) acc;
        const           XXH_vec* const xkey = (const XXH_vec*) key;

        const XXH_vec k1 = XXH_MM(set1_epi32) ((int)PRIME32_1);
        const XXH_vec k2 = XXH_MM(set1_epi32) ((int)PRIME32_2);

        size_t i;
        for (i=0; i < STRIPE_LEN / VEC_SIZE; i++) {
            XXH_vec data = xacc[i];
            XXH_vec const shifted = XXH_MM(srli_epi64) (data, 47);
            data = XXH_MM_SI(xor) (data, shifted);

            {
                XXH_vec const k   = XXH_MM_SI(loadu) (xkey+i);
                XXH_vec const dk  = XXH_MM_SI(xor)   (data,k);          /* U32 dk[4]  = {d0+k0, d1+k1, d2+k2, d3+k3} */

                XXH_vec const dk1 = XXH_MM(mul_epu32) (dk,k1);

                XXH_vec const d2  = XXH_MM(shuffle_epi32) (dk, 0x31);
                XXH_vec const dk2 = XXH_MM(mul_epu32) (d2,k2);

                xacc[i] = XXH_MM_SI(xor) (dk1, dk2);
            }
        }
    }

#else // ...
}

(I know I need to realign)

As for my Makefile logic, I have it so it will compile the multiple code paths if you add MULTI_TARGET=1 to make. IDK if we should do it by default or not, so for now it is optional.

Basically, the hashLong code is now in xxh3-target.c. If XXH_MULTI_TARGET is not defined, it will just be #included in xxh3.h, otherwise, it will be compiled 3 times to make xxh3-scalar.o, xxh3-avx,o, and xxh3-sse2.o.

# XXX: enable by default?
ifndef MULTI_TARGET
   TARGET_OBJS :=
else
  # Multi targeting only works for x86 and x86_64 right now.
  ifneq (,$(filter __i386__ __x86_64__ _M_IX86 _M_X64 _M_AMD64,$(shell $(CC) -E -dM -xc /dev/null)))
    TARGET_OBJS := xxh3-avx2.o xxh3-sse2.o xxh3-scalar.o
    CFLAGS += -DXXH_MULTI_TARGET
  else
    TARGET_OBJS :=
  endif
endif

xxhsum: xxhash.o xxhsum.o $(TARGET_OBJS)

xxh3-avx2.o: xxh3-target.c xxhash.h
        $(CC) -c $(FLAGS) $< -mavx2 -o $@
xxh3-sse2.o: xxh3-target.c xxhash.h
        $(CC) -c $(FLAGS) $< -msse2 -mno-sse3 -o $@
xxh3-scalar.o: xxh3-target.c xxhash.h
        $(CC) -c $(FLAGS) $< -mno-sse2 -o $@

easyaspi314 commented 5 years ago

~~I was checking Agner Fog's timing tables and I think we should change pshufd to psrlq.~~

There is little to no difference on modern processors, but pshufd tends to be slower than psrlq on older chips:

CPU	`pshufd` cycles	`pshufd` latency	`psrlq` cycles	`psrlq` latency
AMD K8	3	3	2	2
AMD K10	1	3	1	3
AMD Big Rigs	1	2	1	2
AMD Ryzen+	2	1	2	1
AMD Bobcat	3	2	2	1
AMD Jaguar	1	2	1	1
Intel Pentium 4	1	5	1	5
Intel Pentium M	3	2	2	2
Intel Merom	3	1	1	1
Intel Penryn+	1	1	1	1

easyaspi314 commented 5 years ago

Wait, scratch that. Duh. pshufd is gonna be faster because it doesn't overwrite the source operand. SSE2 would need an extra movdqa with psrlq.

easyaspi314 commented 5 years ago

@Cyan4973 would this be a correct implementation of the (updated) scalar loops? Because even when compiled without SSE2 (I checked the assembly, all scalar instructions), I get about 6.6-7.0 GB/s this way on 64-bit compared to 5.3 with the previous method. The reason for this is this makes clang use 64-bit xor instructions instead of pairs of 32-bit xors.

Since this has nothing to bother 32-bit with (it will be doing everything with two registers anyways), it doesn't slow it down.

It only works with the XXH_readLE64 though, two XXH_readLE32s and an shift+or don't work.

XXH_FORCE_INLINE void
XXH3_accumulate_512(void *restrict acc, const void *restrict data, const void *restrict key)
{
          U64* const xacc  =       (U64*) acc;   /* presumed aligned */
    const U32* const xdata = (const U32*) data;
    const U32* const xkey  = (const U32*) key;
    size_t i;
    for (i=0; i < ACC_NB; i++) {
        U64 const data_val = XXH_readLE64(xdata + 2 * i);
        U64 const key_val = XXH3_readKey64(xkey + 2 * i);
        U64 const data_key  = key_val ^ data_val;
        xacc[i] += XXH_mult32to64(data_key & 0xFFFFFFFF, data_key >> 32);
        xacc[i] += data_val;
    }
}

XXH_FORCE_INLINE void
XXH3_scrambleAcc(void* restrict acc, const void* restrict key)
{
          U64* const xacc =       (U64*) acc;
    const U32* const xkey = (const U32*) key;

    size_t i;
    for (i = 0; i < ACC_NB; i++) {
        U64 const acc_val   = xacc[i];
        U64 const shifted   = acc_val >> 47;
        U64 const data      = acc_val ^ shifted;

        U64 const key_val = XXH3_readKey64(xkey + 2 * i);

        U64 const data_key  = key_val ^ data;

        U64 const product1 = XXH_mult32to64(PRIME32_1, (data_key & 0xFFFFFFFF));
        U64 const product2 = XXH_mult32to64(PRIME32_2, (data_key >> 32));

        xacc[i] = product1 ^ product2;
    }
}

@Zhentar this might be useful in your quest to optimize your C# implementation.

Cyan4973 commented 5 years ago

Looks great @easyaspi314 , the accumulate_512 function looks very good to me.

Sidenote : I'm modifying (again!) the scramble function, in a way which should be more friendly to scalar (though it's a side effect, the main objective is to nullify space reduction, which the new version seems to tackle completely). I'll update xxh3 branch later today.

note : OK, I've got 2 possibilities for the scrambler. I selected one which is slightly more friendly to the scalar version. It adds one operation in the vectorial version, but also saves one const register. I've uploaded the new version, even though it's still undergoing quality testing. Collision rate is perfect, but I'm not yet sure of dispersion / bias.

note 2 : I tested your scalar variant of accumulate_512, and it's way better. 50% faster for gcc, and 2x faster for clang, which was already much faster than gcc to begin with. I suspect clang might autovectorize something in the new scalar path, because its resulting level of performance is better XXH64 (though not at the level of manual SSE2 variant).

easyaspi314 commented 5 years ago

You're killin' me, Smalls! :joy:

As for scalar, . And like I said, with the fixed version, it gets pretty nice performance; it is actually faster than XXH64 (although it is just a ruse, Clang generates 4 erroneous movs in the XXH64 loop bringing it down from 10.1 GB/s).

From what I saw, that full 64-bit multiply would be terrible for 32-bit. Full 64-bit multiplies were what slowed down XXH64.

U64 mult64(U64 a, U64 b)
{
    return a * b;
}

mult64:                                 # @mult64
        push    esi
        mov     ecx, dword ptr [esp + 16]
        mov     esi, dword ptr [esp + 8]
        mov     eax, ecx
        imul    ecx, dword ptr [esp + 12]
        mul     esi
        imul    esi, dword ptr [esp + 20]
        add     edx, ecx
        add     edx, esi
        pop     esi
        ret

mult64:
        push    {r11, lr}
        umull   r12, lr, r2, r0
        mla     r1, r2, r1, lr
        mla     r1, r3, r0, r1
        mov     r0, r12
        pop     {r11, pc}

(That is two 32-bit multiplies, one 32->64-bit multiply, and two 32-bit adds).

With my version, code gen doesn't seem to be that bad on all platforms.

All x86_64 chips have SSE2 and all aarch64 chips have NEON, and with my fixes, XXH3 is very fast even without SSE2.

By the way, what do you think about my current version? I made a few (mostly aesthetic) changes:

I used the unified AVX2 and SSE2 macros/typedefs
I used better variable names. Even though those variable names originated from FARSH, clearly, variable names aren't your strength:
```
static U64 XXH3_mul128_fold64(U64 ll1, U64 ll2)
{
__uint128_t lll = (__uint128_t)ll1 * ll2;
return (U64)lll ^ (U64)(lll >> 64);
}
```
I put comments above pretty much every single SIMD intrinsic, because they're not very self-documenting. I am torn between after and above line comments: After the line comments keep the code flow, but they make the lines super long and they are often scanned over.
Updated XXH3_accumulate_scalar function
I duped the code block for ARM and aarch64 because they are pretty different. I might actually make two entirely different blocks.

Code:

```c XXH_FORCE_INLINE void XXH3_accumulate_512(void *restrict acc, const void *restrict data, const void *restrict key) { #if (XXH_VECTOR == XXH_AVX2) || (XXH_VECTOR == XXH_SSE2) assert(((size_t)acc) & (VEC_SIZE - 1) == 0); { ALIGN(VEC_SIZE) XXH_vec* const xacc = (XXH_vec *) acc; const XXH_vec* const xdata = (const XXH_vec *) data; const XXH_vec* const xkey = (const XXH_vec *) key; size_t i; for (i=0; i < STRIPE_LEN / VEC_SIZE; i++) { /* data_vec = xdata[i]; */ XXH_vec const data_vec = XXH_MM_SI(loadu) (xdata + i); /* key_vec = xkey[i]; */ XXH_vec const key_vec = XXH_MM_SI(loadu) (xkey + i); /* data_key = data_vec ^ key_vec; */ XXH_vec const data_key = XXH_MM_SI(xor) (data_vec, key_vec); /* shuffled = data_key[1, undef, 3, undef]; // essentially data_key >> 32; */ XXH_vec const shuffled = XXH_MM(shuffle_epi32) (data_key, 0x31); /* product = (shuffled & 0xFFFFFFFF) * (data_key & 0xFFFFFFFF); */ XXH_vec const product = XXH_MM(mul_epu32) (shuffled, data_key); /* xacc[i] += data_vec; */ xacc[i] = XXH_MM(add_epi64) (xacc[i], data_vec); /* xacc[i] += product; */ xacc[i] = XXH_MM(add_epi64) (xacc[i], product); } } #elif (XXH_VECTOR == XXH_NEON) /* to be updated, no longer with latest sse/avx updates */ assert(((size_t)acc) & 15 == 0); { uint64x2_t* const xacc = (uint64x2_t *) acc; /* We don't use a uint32x4_t pointer because it causes bus errors on ARMv7. */ uint32_t const* const xdata = (const uint32_t *) data; uint32_t const* const xkey = (const uint32_t *) key; size_t i; for (i=0; i < STRIPE_LEN / sizeof(uint64x2_t); i++) { #if !defined(__aarch64__) && !defined(__arm64__) /* ARM32-specific hack */ /* vzip on ARMv7 Clang generates a lot of vmovs (technically vorrs) without this. * vzip on 32-bit ARM NEON will overwrite the original register, and I think that Clang * assumes I don't want to destroy it and tries to make a copy. This slows down the code * a lot. * aarch64 not only uses an entirely different syntax, but it requires three * instructions... * ext v1.16B, v0.16B, #8 // select high bits because aarch64 can't address them directly * zip1 v3.2s, v0.2s, v1.2s // first zip * zip2 v2.2s, v0.2s, v1.2s // second zip * ...to do what ARM does in one: * vzip.32 d0, d1 // Interleave high and low bits and overwrite. */ /* data_vec = xdata[i]; */ uint32x4_t const data_vec = vld1q_u32(xdata + (i * 4)); /* key_vec = xkey[i]; */ uint32x4_t const key_vec = vld1q_u32(xkey + (i * 4)); /* data_key = data_vec ^ key_vec; */ uint32x4_t data_key = veorq_u32(data_vec, key_vec); /* Here's the magic. We use the quirkiness of vzip to shuffle data_key in place. * shuffle: data_key[0, 1, 2, 3] = data_key[0, 2, 1, 3] */ __asm__("vzip.32 %e0, %f0" : "+w" (data_key)); /* xacc[i] += data_vec; */ xacc[i] = vaddq_u64(xacc[i], vreinterpretq_u64_u32(data_vec)); /* xacc[i] += (uint64x2_t) data_key[0, 1] * (uint64x2_t) data_key[2, 3]; */ xacc[i] = vmlal_u32(xacc[i], vget_low_u32(data_key), vget_high_u32(data_key)); #else /* On aarch64, vshrn/vmovn seems to be equivalent to, if not faster than, the vzip method. */ /* data_vec = xdata[i]; */ uint32x4_t const data_vec = vld1q_u32(xdata + (i * 4)); /* key_vec = xkey[i]; */ uint32x4_t const key_vec = vld1q_u32(xkey + (i * 4)); /* data_key = data_vec ^ key_vec; */ uint32x4_t const data_key = veorq_u32(data_vec, key_vec); /* data_key_lo = (uint32x2_t) (data_key & 0xFFFFFFFF); */ uint32x2_t const data_key_lo = vmovn_u64 (vreinterpretq_u64_u32(data_key)); /* data_key_hi = (uint32x2_t) (data_key >> 32); */ uint32x2_t const data_key_hi = vshrn_n_u64 (vreinterpretq_u64_u32(data_key), 32); /* xacc[i] += data_vec; */ xacc[i] = vaddq_u64 (xacc[i], vreinterpretq_u64_u32(data_vec)); /* xacc[i] += (uint64x2_t) data_key_lo * (uint64x2_t) data_key_hi; */ xacc[i] = vmlal_u32 (xacc[i], data_key_lo, data_key_hi); #endif } } #else /* scalar variant - universal */ U64* const xacc = (U64*) acc; /* presumed aligned */ const U32* const xdata = (const U32*) data; const U32* const xkey = (const U32*) key; size_t i; for (i=0; i < ACC_NB; i++) { U64 const data_val = XXH_readLE64(xdata + 2 * i); U64 const key_val = XXH3_readKey64(xkey + 2 * i); U64 const data_key = key_val ^ data_val; xacc[i] += XXH_mult32to64(data_key & 0xFFFFFFFF, data_key >> 32); xacc[i] += data_val; } #endif } XXH_FORCE_INLINE void XXH3_scrambleAcc(void* restrict acc, const void* restrict key) { #if (XXH_VECTOR == XXH_AVX2) || (XXH_VECTOR == XXH_SSE2) assert(((size_t)acc) & (VEC_SIZE - 1) == 0); { ALIGN(VEC_SIZE) XXH_vec * const xacc = (XXH_vec*) acc; XXH_vec const* const xkey = (const XXH_vec*) key; XXH_vec const prime1 = XXH_MM(set1_epi32) ((int) PRIME32_1); XXH_vec const prime2 = XXH_MM(set1_epi32) ((int) PRIME32_2); size_t i; for (i=0; i < STRIPE_LEN / VEC_SIZE; i++) { /* data_vec = xacc[i] ^ (xacc[i] >> 47); */ XXH_vec const acc_vec = xacc[i]; XXH_vec const shifted = XXH_MM(srli_epi64) (acc_vec, 47); XXH_vec const data_vec = XXH_MM_SI(xor) (acc_vec, shifted); /* key_vec = xkey[i]; */ XXH_vec const key_vec = XXH_MM_SI(loadu) (xkey + i); /* data_key = data_vec ^ key_vec; */ XXH_vec const data_key = XXH_MM_SI(xor) (data_vec, key_vec); /* shuffled = data_key[1, undef, 3, undef]; // essentially data_key >> 32; */ XXH_vec const shuffled = XXH_MM(shuffle_epi32) (data_key, 0x31); /* product1 = (data_key & 0xFFFFFFFF) * (uint64x2_t) PRIME32_1; */ XXH_vec const product1 = XXH_MM(mul_epu32) (data_key, prime1); /* product2 = (shuffled & 0xFFFFFFFF) * (uint64x2_t) PRIME32_2; */ XXH_vec const product2 = XXH_MM(mul_epu32) (shuffled, prime2); /* xacc[i] = product1 ^ product2; */ xacc[i] = XXH_MM_SI(xor) (product1, product2); } } #elif (XXH_VECTOR == XXH_NEON) assert(((size_t)acc) & 15 == 0); { uint64x2_t* const xacc = (uint64x2_t*) acc; uint32_t const* const xkey = (uint32_t const*) key; uint32x2_t const prime1 = vdup_n_u32 (PRIME32_1); uint32x2_t const prime2 = vdup_n_u32 (PRIME32_2); size_t i; for (i=0; i < STRIPE_LEN/sizeof(uint64x2_t); i++) { /* data_vec = xacc[i] ^ (xacc[i] >> 47); */ uint64x2_t const acc_vec = xacc[i]; uint64x2_t const shifted = vshrq_n_u64 (acc_vec, 47); uint64x2_t const data_vec = veorq_u64 (acc_vec, shifted); /* key_vec = xkey[i]; */ uint32x4_t const key_vec = vld1q_u32 (xkey + (i * 4)); /* data_key = data_vec ^ key_vec; */ uint32x4_t const data_key = veorq_u32 (vreinterpretq_u32_u64(data_vec), key_vec); /* shuffled = { data_key[0, 2], data_key[1, 3] }; */ uint32x2x2_t const shuffled = vzip_u32 (vget_low_u32(data_key), vget_high_u32(data_key)); /* product1 = (uint64x2_t) shuffled[0] * (uint64x2_t) PRIME32_1; */ uint64x2_t const product1 = vmull_u32 (shuffled.val[0], prime1); /* product2 = (uint64x2_t) shuffled[1] * (uint64x2_t) PRIME32_2; */ uint64x2_t const product2 = vmull_u32 (shuffled.val[1], prime2); /* xacc[i] = product1 ^ product2; */ xacc[i] = veorq_u64(product1, product2); } } #else /* scalar variant - universal */ U64* const xacc = (U64*) acc; const U32* const xkey = (const U32*) key; size_t i; for (i = 0; i < ACC_NB; i++) { U64 const acc_val = xacc[i]; U64 const shifted = acc_val >> 47; U64 const data = acc_val ^ shifted; U64 const key_val = XXH3_readKey64(xkey + 2 * i); U64 const data_key = key_val ^ data; U64 const product1 = XXH_mult32to64(PRIME32_1, (data_key & 0xFFFFFFFF)); U64 const product2 = XXH_mult32to64(PRIME32_2, (data_key >> 32)); xacc[i] = product1 ^ product2; } #endif } ```

Cyan4973 commented 5 years ago

Yes, I like your changes @easyaspi314, they help readability, it's a great positive.

easyaspi314 commented 5 years ago

Oh wait, I saw, you did a 64 by 32 multiply.

easyaspi314 commented 5 years ago

As for your code, x86_64 and aarch64 are much smaller, but 32-bit doesn't see as much of a difference.

However, the one I was using was faster on both i386 and x86_64-scalar:

32-bit:

My version: (clang -m32 -O3 -mno-sse2)

./xxhsum 0.7.0 (32-bits i386 little endian), Clang 8.0.0 (tags/RELEASE_800/final), by Yann Collet
Sample of 100 KB...
XXH32               :     102400 ->    34976 it/s ( 3415.6 MB/s)
XXH32 unaligned     :     102400 ->    34427 it/s ( 3362.0 MB/s)
XXH64               :     102400 ->    13591 it/s ( 1327.2 MB/s)
XXH64 unaligned     :     102400 ->    13418 it/s ( 1310.3 MB/s)
XXH3_64bits         :     102400 ->    29382 it/s ( 2869.4 MB/s)
XXH3_64b unaligned  :     102400 ->    29458 it/s ( 2876.7 MB/s)

Your new version:


./xxhsum 0.7.0 (32-bits i386 little endian), Clang 8.0.0 (tags/RELEASE_800/final), by Yann Collet
Sample of 100 KB...
XXH32               :     102400 ->    34746 it/s ( 3393.2 MB/s)
XXH32 unaligned     :     102400 ->    34149 it/s ( 3334.9 MB/s)
XXH64               :     102400 ->    13632 it/s ( 1331.3 MB/s)
XXH64 unaligned     :     102400 ->    13470 it/s ( 1315.4 MB/s)
XXH3_64bits         :     102400 ->    25121 it/s ( 2453.2 MB/s)
XXH3_64b unaligned  :     102400 ->    24559 it/s ( 2398.4 MB/s)

64-bit

My version: (clang -mno-sse2, it says SSE2 because clang messes up the benchmark without it)

./xxhsum 0.7.0 (64-bits x86_64 + SSE2 little endian), Clang 8.0.0 (tags/RELEASE_800/final), by Yann Collet
Sample of 100 KB...
XXH32               :     102400 ->    42382 it/s ( 4138.9 MB/s)
XXH32 unaligned     :     102400 ->    42508 it/s ( 4151.1 MB/s)
XXH64               :     102400 ->    68385 it/s ( 6678.3 MB/s)
XXH64 unaligned     :     102400 ->    67336 it/s ( 6575.8 MB/s)
XXH3_64bits         :     102400 ->    60443 it/s ( 5902.6 MB/s)
XXH3_64b unaligned  :     102400 ->    59623 it/s ( 5822.5 MB/s)

Your version:

./xxhsum 0.7.0 (64-bits x86_64 + SSE2 little endian), Clang 8.0.0 (tags/RELEASE_800/final), by Yann Collet
Sample of 100 KB...
XXH32               :     102400 ->    41963 it/s ( 4098.0 MB/s)
XXH32 unaligned     :     102400 ->    42648 it/s ( 4164.9 MB/s)
XXH64               :     102400 ->    68673 it/s ( 6706.3 MB/s)
XXH64 unaligned     :     102400 ->    67688 it/s ( 6610.1 MB/s)
XXH3_64bits         :     102400 ->    48290 it/s ( 4715.8 MB/s)
XXH3_64b unaligned  :     102400 ->    46441 it/s ( 4535.3 MB/s)

((My laptop was a little warm so it wasn't going into Turbo Boost)

easyaspi314 commented 5 years ago

Wait, clang was just not unrolling the loop. The result is a tiny bit faster.

32-bit:

XXH3_64bits         :     102400 ->    30095 it/s ( 2939.0 MB/s)
XXH3_64b unaligned  :     102400 ->    30123 it/s ( 2941.7 MB/s)

64-bit

XXH3_64bits         :     102400 ->    63576 it/s ( 6208.6 MB/s)
XXH3_64b unaligned  :     102400 ->    62257 it/s ( 6079.7 MB/s)

Cyan4973 commented 5 years ago

What are the differences between the versions tested ? Is it limited to the scrambler ? I'm not completely sure what is being measured.

I would expect most of the speed difference to come from the new accumulator, which I already acknowledged is a much faster variant. But that doesn't tell us much about the scrambler. I would expect the speed difference attributed to the scrambler to be pretty small, if measurable at all.

P.S : I made some tests by integrating your new accumulator, comparing the new and the old scrambler. Differences are small but measurable

variant	old	new	diff
clang x64	17100 MB/s	17700 MB/s	+
clang -m32	17100 MB/s	17000 MB/s	-
clang -m32 -no-sse2	4900 MB/s	5050 MB/s	+
gcc x64	9550 MB/s	9650 MB/s	+
gcc -m32	3650 MB/s	3600 MB/s	-
gcc -m32 -no-sse2	3600 MB/s	3600 MB/s	=

easyaspi314 commented 5 years ago

Yeah, the first version was my working tree vs your tree, but that last one was on my tree, differing only in the scrambler.

Clang was notably bitching about not being able to unroll the loop… Yes, the same Clang that does this (it will unroll thousands of times, it has no limit)

Also, for some reason, my laptop was warm because there were rogue dotnet benchmarks running in the background and the processes were locking my CPU into Turbo Boost.

why

Also, does GCC have SSE2 enabled? Try -msse2. I knew GCC was bad, but I didn't know it was that bad.

Cyan4973 commented 5 years ago

Well, turning on -msse2 with -m32 on gcc is actually even worse : speed goes down to 1100 MB/s

easyaspi314 commented 5 years ago

Whaaaaat?

~/xxh3 $ gcc-8 -m32 -mno-sse3 -msse2 -DXXH_VECTOR=0 -c -O3 xxhash.c
~/xxh3 $ gcc-8 -m32 -mno-sse3 -msse2 -DXXH_VECTOR=0 -c -O3 xxhsum.c
~/xxh3 $ clang *.o -m32 -o xxhsum
ld: warning: The i386 architecture is deprecated for macOS (remove from the Xcode build setting: ARCHS)
ld: warning: ignoring file /usr/local/Cellar/llvm/8.0.0/lib/clang/8.0.0/lib/darwin/libclang_rt.osx.a, missing required architecture i386 in file /usr/local/Cellar/llvm/8.0.0/lib/clang/8.0.0/lib/darwin/libclang_rt.osx.a (2 slices)
~/xxh3 $ ./xxhsum -b
./xxhsum 0.7.0 (32-bits i386 + SSE2 little endian), GCC 8.3.0, by Yann Collet
Sample of 100 KB...
XXH32               :     102400 ->    53042 it/s ( 5179.9 MB/s)
XXH32 unaligned     :     102400 ->    52267 it/s ( 5104.2 MB/s)
XXH64               :     102400 ->    16333 it/s ( 1595.0 MB/s)
XXH64 unaligned     :     102400 ->    16407 it/s ( 1602.2 MB/s)
XXH3_64bits         :     102400 ->     8526 it/s (  832.6 MB/s)
XXH3_64b unaligned  :     102400 ->     8584 it/s (  838.3 MB/s)

oh god, what the actual f***, GCC?

```asm .text .align 4,0x90 _XXH3_hashLong: LFB5372: pushl %ebp LCFI0: pushl %edi LCFI1: call ___x86.get_pc_thunk.di L1$pb: pushl %esi LCFI2: pushl %ebx LCFI3: movl %eax, %ebx movl %ecx, %eax subl $412, %esp LCFI4: shrl $10, %eax movl %edx, 392(%esp) movl %ecx, 388(%esp) movl %edi, 396(%esp) testl %eax, %eax je L7 movl %edi, %ecx movl 128+_kKey-L1$pb(%edi), %edi sall $10, %eax movl 132+_kKey-L1$pb(%ecx), %ebp movl %edi, 312(%esp) movl 136+_kKey-L1$pb(%ecx), %edi movl %ebp, 316(%esp) movl 140+_kKey-L1$pb(%ecx), %ebp movl %edi, 320(%esp) movl 144+_kKey-L1$pb(%ecx), %edi movl %ebp, 324(%esp) movl 148+_kKey-L1$pb(%ecx), %ebp movl %edi, 328(%esp) movl 152+_kKey-L1$pb(%ecx), %edi movl %ebp, 332(%esp) movl 156+_kKey-L1$pb(%ecx), %ebp movl %edi, 336(%esp) movl 160+_kKey-L1$pb(%ecx), %edi movl %ebp, 340(%esp) movl 164+_kKey-L1$pb(%ecx), %ebp movl %edi, 344(%esp) movl 168+_kKey-L1$pb(%ecx), %edi movl %ebp, 348(%esp) movl 172+_kKey-L1$pb(%ecx), %ebp movl %edi, 352(%esp) movl 176+_kKey-L1$pb(%ecx), %edi movl %ebp, 356(%esp) movl 180+_kKey-L1$pb(%ecx), %ebp movl %edi, 360(%esp) movl 184+_kKey-L1$pb(%ecx), %edi movl %ebp, 364(%esp) movl 188+_kKey-L1$pb(%ecx), %ebp movl %edi, 368(%esp) movl %ebp, 372(%esp) movl (%ebx), %edi movl 4(%ebx), %ebp movl %edi, 152(%esp) movl 8(%ebx), %edi movl %ebp, 156(%esp) movl 12(%ebx), %ebp movl %edi, 160(%esp) movl 16(%ebx), %edi movl %ebp, 164(%esp) movl 20(%ebx), %ebp movl %edi, 168(%esp) movl 24(%ebx), %edi movl %ebp, 172(%esp) movl 28(%ebx), %ebp movl %edi, 176(%esp) movl 32(%ebx), %edi movl %ebp, 180(%esp) movl 36(%ebx), %ebp movl %edi, 184(%esp) movl 40(%ebx), %edi movl %ebp, 188(%esp) movl 44(%ebx), %ebp movl %edi, 192(%esp) movl 48(%ebx), %edi movl %ebp, 196(%esp) movl 52(%ebx), %ebp movl %edi, 200(%esp) movl 56(%ebx), %edi movl %ebp, 204(%esp) movl 60(%ebx), %ebp movl %edi, (%esp) movl %ebp, 4(%esp) movl 392(%esp), %edi movl 36+_kKey-L1$pb(%ecx), %edx addl %edi, %eax movl %edi, %ebp movl %eax, 380(%esp) movl 32+_kKey-L1$pb(%ecx), %eax movl %edx, 228(%esp) movl 44+_kKey-L1$pb(%ecx), %edx movl %eax, 224(%esp) movl 40+_kKey-L1$pb(%ecx), %eax movl %edx, 212(%esp) movl 52+_kKey-L1$pb(%ecx), %edx movl %eax, 208(%esp) movl 48+_kKey-L1$pb(%ecx), %eax movl %edx, 292(%esp) movl 4+_kKey-L1$pb(%ecx), %edx movl %eax, 288(%esp) movl _kKey-L1$pb(%ecx), %eax movl %edx, 244(%esp) movl 12+_kKey-L1$pb(%ecx), %edx movl %eax, 240(%esp) movl 8+_kKey-L1$pb(%ecx), %eax movl %edx, 260(%esp) movl 20+_kKey-L1$pb(%ecx), %edx movl %eax, 256(%esp) movl 16+_kKey-L1$pb(%ecx), %eax movl %edx, 276(%esp) movl %eax, 272(%esp) leal _kKey-L1$pb(%ecx), %eax movl %eax, 384(%esp) .align 4,0x90 L6: movl 384(%esp), %eax movl %ebp, %ecx movl %ebp, 308(%esp) movl (%esp), %esi movdqa 272(%esp), %xmm4 movdqa 256(%esp), %xmm3 leal 128(%eax), %edi movl %eax, %ebp movdqa 240(%esp), %xmm1 movdqa 288(%esp), %xmm5 movl %edi, 304(%esp) movl 4(%esp), %edi movdqa 208(%esp), %xmm2 movdqa 224(%esp), %xmm0 .align 4,0x90 L5: movq (%ecx), %xmm6 pxor %xmm6, %xmm1 movd %xmm1, %edx psrlq $32, %xmm1 movd %xmm1, %eax mull %edx movd %edx, %xmm7 movd %eax, %xmm1 punpckldq %xmm7, %xmm1 paddq %xmm1, %xmm6 movd %xmm6, 80(%esp) psrlq $32, %xmm6 movl 80(%esp), %eax addl %eax, 152(%esp) movd %xmm6, 84(%esp) movdqa %xmm3, %xmm6 movl 84(%esp), %edx adcl %edx, 156(%esp) movq 152(%esp), %xmm7 movq %xmm7, (%ebx) movq 8(%ecx), %xmm7 pxor %xmm7, %xmm6 movd %xmm6, %edx psrlq $32, %xmm6 movd %xmm6, %eax mull %edx movd %edx, %xmm6 movd %eax, %xmm1 punpckldq %xmm6, %xmm1 paddq %xmm7, %xmm1 movdqa %xmm4, %xmm6 movd %xmm1, 96(%esp) psrlq $32, %xmm1 movl 96(%esp), %eax addl %eax, 160(%esp) movd %xmm1, 100(%esp) movl 100(%esp), %edx adcl %edx, 164(%esp) movq 160(%esp), %xmm7 movq %xmm7, 8(%ebx) movq 16(%ecx), %xmm7 pxor %xmm7, %xmm6 movd %xmm6, %edx psrlq $32, %xmm6 movd %xmm6, %eax mull %edx movd %edx, %xmm6 movd %eax, %xmm1 punpckldq %xmm6, %xmm1 paddq %xmm7, %xmm1 movd %xmm1, 112(%esp) psrlq $32, %xmm1 movl 112(%esp), %eax movd %xmm1, 116(%esp) movl 116(%esp), %edx addl %eax, 168(%esp) adcl %edx, 172(%esp) movq 168(%esp), %xmm7 movq %xmm7, 16(%ebx) movl 24(%ebp), %eax movl 28(%ebp), %edx movq 24(%ecx), %xmm6 movl %eax, (%esp) movl %edx, 4(%esp) movdqa (%esp), %xmm7 pxor %xmm6, %xmm7 movdqa %xmm7, %xmm1 movd %xmm7, %edx psrlq $32, %xmm1 movd %xmm1, %eax mull %edx movd %edx, %xmm7 movd %eax, %xmm1 punpckldq %xmm7, %xmm1 paddq %xmm6, %xmm1 movd %xmm1, 128(%esp) psrlq $32, %xmm1 movl 128(%esp), %eax movd %xmm1, 132(%esp) movl 132(%esp), %edx addl %eax, 176(%esp) adcl %edx, 180(%esp) movq 176(%esp), %xmm7 movq %xmm7, 24(%ebx) movq 32(%ecx), %xmm1 pxor %xmm1, %xmm0 movd %xmm0, %edx psrlq $32, %xmm0 movd %xmm0, %eax mull %edx movd %edx, %xmm6 movd %eax, %xmm0 punpckldq %xmm6, %xmm0 paddq %xmm1, %xmm0 movd %xmm0, 16(%esp) psrlq $32, %xmm0 movl 16(%esp), %eax movd %xmm0, 20(%esp) movl 20(%esp), %edx addl %eax, 184(%esp) adcl %edx, 188(%esp) movq 184(%esp), %xmm7 movq %xmm7, 32(%ebx) movq 40(%ecx), %xmm6 movdqa %xmm2, %xmm7 pxor %xmm6, %xmm7 movdqa %xmm7, %xmm1 movd %xmm7, %edx psrlq $32, %xmm1 movd %xmm1, %eax mull %edx movd %edx, %xmm1 movd %eax, %xmm0 punpckldq %xmm1, %xmm0 paddq %xmm6, %xmm0 movd %xmm0, 32(%esp) psrlq $32, %xmm0 movl 32(%esp), %eax addl %eax, 192(%esp) movd %xmm0, 36(%esp) movl 36(%esp), %edx adcl %edx, 196(%esp) movq 192(%esp), %xmm7 movq %xmm7, 40(%ebx) movq 48(%ecx), %xmm6 movdqa %xmm5, %xmm7 pxor %xmm6, %xmm7 movdqa %xmm7, %xmm1 movd %xmm7, %edx psrlq $32, %xmm1 movd %xmm1, %eax mull %edx movd %edx, %xmm1 movd %eax, %xmm0 punpckldq %xmm1, %xmm0 paddq %xmm6, %xmm0 movd %xmm0, 48(%esp) psrlq $32, %xmm0 movl 48(%esp), %eax addl %eax, 200(%esp) movd %xmm0, 52(%esp) movl 52(%esp), %edx adcl %edx, 204(%esp) movq 200(%esp), %xmm7 movq %xmm7, 48(%ebx) movq 56(%ecx), %xmm6 movq 56(%ebp), %xmm1 movdqa %xmm6, %xmm7 pxor %xmm1, %xmm7 movdqa %xmm7, %xmm0 movd %xmm7, %edx psrlq $32, %xmm0 movd %xmm0, %eax mull %edx movd %eax, %xmm0 movd %edx, %xmm7 punpckldq %xmm7, %xmm0 paddq %xmm6, %xmm0 movd %xmm0, 64(%esp) psrlq $32, %xmm0 addl 64(%esp), %esi movd %xmm0, 68(%esp) adcl 68(%esp), %edi addl $8, %ebp addl $64, %ecx movdqa %xmm2, %xmm0 movdqa %xmm5, %xmm2 movdqa %xmm1, %xmm5 movdqa %xmm3, %xmm1 movdqa %xmm4, %xmm3 movdqa (%esp), %xmm4 movl %esi, 56(%ebx) movl %edi, 60(%ebx) cmpl 304(%esp), %ebp jne L5 movl %esi, (%esp) movl 312(%esp), %eax movl $-1640531535, %ecx movl 152(%esp), %esi movl %edi, 4(%esp) movl 156(%esp), %edi movl 316(%esp), %edx movl 308(%esp), %ebp xorl %esi, %eax movl %edi, %esi xorl %edi, %edx shrl $15, %esi xorl %edi, %edi xorl %esi, %eax xorl %edi, %edx movl 160(%esp), %esi imull $-1640531535, %edx, %edi mull %ecx movl %edx, 156(%esp) movl 324(%esp), %edx addl %edi, 156(%esp) movl %eax, 152(%esp) movl 164(%esp), %edi movl 320(%esp), %eax movq 152(%esp), %xmm5 xorl %edi, %edx movq %xmm5, (%ebx) xorl %esi, %eax movl %edi, %esi xorl %edi, %edi shrl $15, %esi xorl %edi, %edx xorl %esi, %eax imull $-1640531535, %edx, %edi movl 168(%esp), %esi mull %ecx movl %edx, 164(%esp) movl 332(%esp), %edx addl %edi, 164(%esp) movl %eax, 160(%esp) movl 172(%esp), %edi movl 328(%esp), %eax movq 160(%esp), %xmm4 xorl %edi, %edx movq %xmm4, 8(%ebx) xorl %esi, %eax movl %edi, %esi xorl %edi, %edi shrl $15, %esi xorl %edi, %edx xorl %esi, %eax imull $-1640531535, %edx, %edi mull %ecx movl %edx, 172(%esp) addl %edi, 172(%esp) movl %eax, 168(%esp) movq 168(%esp), %xmm2 movq %xmm2, 16(%ebx) movl 180(%esp), %edi movl 176(%esp), %esi movl 336(%esp), %eax movl 340(%esp), %edx xorl %esi, %eax movl %edi, %esi shrl $15, %esi xorl %edi, %edx xorl %edi, %edi xorl %esi, %eax xorl %edi, %edx movl 184(%esp), %esi imull $-1640531535, %edx, %edi mull %ecx movl %edx, 180(%esp) movl 348(%esp), %edx addl %edi, 180(%esp) movl %eax, 176(%esp) movl 188(%esp), %edi movl 344(%esp), %eax movq 176(%esp), %xmm3 xorl %edi, %edx movq %xmm3, 24(%ebx) xorl %esi, %eax movl %edi, %esi xorl %edi, %edi shrl $15, %esi xorl %edi, %edx xorl %esi, %eax imull $-1640531535, %edx, %edi movl 192(%esp), %esi mull %ecx movl %edx, 188(%esp) movl 356(%esp), %edx addl %edi, 188(%esp) movl %eax, 184(%esp) movl 196(%esp), %edi movl 352(%esp), %eax movq 184(%esp), %xmm5 xorl %edi, %edx movq %xmm5, 32(%ebx) xorl %esi, %eax movl %edi, %esi xorl %edi, %edi shrl $15, %esi xorl %edi, %edx xorl %esi, %eax imull $-1640531535, %edx, %edi mull %ecx movl %edx, 196(%esp) movl 364(%esp), %edx addl %edi, 196(%esp) movl %eax, 192(%esp) movq 192(%esp), %xmm4 movl 360(%esp), %eax movq %xmm4, 40(%ebx) movl 200(%esp), %esi movl 204(%esp), %edi xorl %esi, %eax movl %edi, %esi xorl %edi, %edx xorl %edi, %edi shrl $15, %esi xorl %edi, %edx xorl %esi, %eax imull $-1640531535, %edx, %edi movl (%esp), %esi mull %ecx movl %edx, 204(%esp) movl 372(%esp), %edx addl %edi, 204(%esp) movl %eax, 200(%esp) movl 4(%esp), %edi movl 368(%esp), %eax movq 200(%esp), %xmm2 movq %xmm2, 48(%ebx) xorl %esi, %eax movl %edi, %esi xorl %edi, %edx xorl %edi, %edi shrl $15, %esi addl $1024, %ebp xorl %edi, %edx xorl %esi, %eax imull $-1640531535, %edx, %edi mull %ecx movl %edx, 4(%esp) addl %edi, 4(%esp) movl %eax, (%esp) movq (%esp), %xmm3 movq %xmm3, 56(%ebx) cmpl 380(%esp), %ebp jne L6 L7: movl 388(%esp), %eax movl 392(%esp), %ecx movl %eax, %ebp andl $-1024, %eax shrl $6, %ebp addl %eax, %ecx andl $15, %ebp je L4 movl (%ebx), %eax movl 4(%ebx), %edx movl 396(%esp), %edi movq 40(%ebx), %xmm4 movl %eax, 80(%esp) movl 8(%ebx), %eax movl %edx, 84(%esp) movl 12(%ebx), %edx movq 32+_kKey-L1$pb(%edi), %xmm0 leal _kKey-L1$pb(%edi), %esi movq 48(%ebx), %xmm3 movl %eax, (%esp) movl 16(%ebx), %eax movl %edx, 4(%esp) movl 20(%ebx), %edx movq (%esp), %xmm5 movq 56(%ebx), %xmm2 movl %eax, 96(%esp) movl 24(%ebx), %eax movl %edx, 100(%esp) movl 28(%ebx), %edx movl %eax, 112(%esp) movl 32(%ebx), %eax movl %edx, 116(%esp) movl 36(%ebx), %edx movl %eax, 128(%esp) movl 40+_kKey-L1$pb(%edi), %eax movl %edx, 132(%esp) movl 44+_kKey-L1$pb(%edi), %edx movl %eax, (%esp) movl 48+_kKey-L1$pb(%edi), %eax movl %edx, 4(%esp) movl 52+_kKey-L1$pb(%edi), %edx movl %eax, 32(%esp) movl 8+_kKey-L1$pb(%edi), %eax movl %edx, 36(%esp) movl 12+_kKey-L1$pb(%edi), %edx movq _kKey-L1$pb(%edi), %xmm1 movl %eax, 48(%esp) movl 16+_kKey-L1$pb(%edi), %eax movl %edx, 52(%esp) movl 20+_kKey-L1$pb(%edi), %edx leal (%esi,%ebp,8), %edi movl %eax, 16(%esp) movl %edx, 20(%esp) .align 4,0x90 L10: movq (%ecx), %xmm6 addl $8, %esi addl $64, %ecx pxor %xmm6, %xmm1 movd %xmm1, %edx psrlq $32, %xmm1 movd %xmm1, %eax mull %edx movd %edx, %xmm7 movd %eax, %xmm1 punpckldq %xmm7, %xmm1 movdqa 80(%esp), %xmm7 paddq %xmm6, %xmm1 paddq %xmm1, %xmm7 movq %xmm7, (%ebx) movq -56(%ecx), %xmm6 movq %xmm7, 80(%esp) movdqa 48(%esp), %xmm7 pxor %xmm6, %xmm7 movdqa %xmm7, %xmm1 movd %xmm7, %edx psrlq $32, %xmm1 movd %xmm1, %eax mull %edx movd %edx, %xmm7 movd %eax, %xmm1 punpckldq %xmm7, %xmm1 paddq %xmm6, %xmm1 movdqa 16(%esp), %xmm7 paddq %xmm1, %xmm5 movq %xmm5, 8(%ebx) movq -48(%ecx), %xmm6 pxor %xmm6, %xmm7 movdqa %xmm7, %xmm1 movd %xmm7, %edx psrlq $32, %xmm1 movd %xmm1, %eax mull %edx movd %edx, %xmm7 movd %eax, %xmm1 punpckldq %xmm7, %xmm1 movdqa 96(%esp), %xmm7 paddq %xmm6, %xmm1 paddq %xmm1, %xmm7 movq %xmm7, 16(%ebx) movl 16(%esi), %eax movl 20(%esi), %edx movq %xmm7, 96(%esp) movq -40(%ecx), %xmm6 movl %eax, 64(%esp) movl %edx, 68(%esp) movdqa 64(%esp), %xmm7 pxor %xmm6, %xmm7 movdqa %xmm7, %xmm1 movd %xmm7, %edx psrlq $32, %xmm1 movd %xmm1, %eax mull %edx movd %edx, %xmm7 movd %eax, %xmm1 punpckldq %xmm7, %xmm1 movdqa 112(%esp), %xmm7 paddq %xmm6, %xmm1 paddq %xmm1, %xmm7 movq %xmm7, 24(%ebx) movq -32(%ecx), %xmm1 movq %xmm7, 112(%esp) movdqa 128(%esp), %xmm7 pxor %xmm1, %xmm0 movd %xmm0, %edx psrlq $32, %xmm0 movd %xmm0, %eax mull %edx movd %edx, %xmm6 movd %eax, %xmm0 punpckldq %xmm6, %xmm0 paddq %xmm1, %xmm0 paddq %xmm0, %xmm7 movq %xmm7, 32(%ebx) movq -24(%ecx), %xmm1 movq %xmm7, 128(%esp) movdqa (%esp), %xmm7 pxor %xmm1, %xmm7 movdqa %xmm7, %xmm0 movd %xmm7, %edx movdqa 32(%esp), %xmm7 psrlq $32, %xmm0 movd %xmm0, %eax mull %edx movd %edx, %xmm6 movd %eax, %xmm0 punpckldq %xmm6, %xmm0 paddq %xmm1, %xmm0 paddq %xmm0, %xmm4 movq %xmm4, 40(%ebx) movq -16(%ecx), %xmm1 pxor %xmm1, %xmm7 movdqa %xmm7, %xmm0 movd %xmm7, %edx psrlq $32, %xmm0 movd %xmm0, %eax mull %edx movd %edx, %xmm6 movd %eax, %xmm0 punpckldq %xmm6, %xmm0 paddq %xmm1, %xmm0 paddq %xmm0, %xmm3 movq %xmm3, 48(%ebx) movq -8(%ecx), %xmm6 movq 48(%esi), %xmm1 movdqa %xmm6, %xmm7 pxor %xmm1, %xmm7 movdqa %xmm7, %xmm0 movd %xmm7, %edx psrlq $32, %xmm0 movd %xmm0, %eax mull %edx movd %edx, %xmm7 movd %eax, %xmm0 punpckldq %xmm7, %xmm0 movdqa 32(%esp), %xmm7 paddq %xmm6, %xmm0 movaps %xmm1, 32(%esp) paddq %xmm0, %xmm2 movdqa (%esp), %xmm0 movaps %xmm7, (%esp) movdqa 16(%esp), %xmm7 movdqa 48(%esp), %xmm1 movq %xmm2, 56(%ebx) movaps %xmm7, 48(%esp) movdqa 64(%esp), %xmm7 movaps %xmm7, 16(%esp) cmpl %edi, %esi jne L10 L4: testb $63, 388(%esp) je L1 movl 392(%esp), %edi sall $3, %ebp movl 388(%esp), %ecx leal -48(%edi,%ecx), %edx leal -64(%edi,%ecx), %eax cmpl %edx, %ebx jnb L14 leal 16(%ebx), %edx cmpl %edx, %eax jb L11 L14: movl 396(%esp), %edi movdqu (%eax), %xmm2 leal _kKey-L1$pb(%edi,%ebp), %edx movdqu (%edx), %xmm1 pxor %xmm2, %xmm1 movdqa lC0-L1$pb(%edi), %xmm2 movdqa %xmm1, %xmm5 psrlq $32, %xmm1 movdqa %xmm1, %xmm4 pand %xmm2, %xmm5 movdqa %xmm5, %xmm0 movdqa %xmm5, %xmm3 psrlq $32, %xmm0 psrlq $32, %xmm4 pmuludq %xmm1, %xmm3 pmuludq %xmm0, %xmm1 movdqa %xmm4, %xmm0 pmuludq %xmm5, %xmm0 paddq %xmm0, %xmm1 movdqu (%eax), %xmm0 paddq (%ebx), %xmm0 psllq $32, %xmm1 paddq %xmm3, %xmm1 paddq %xmm0, %xmm1 movaps %xmm1, (%ebx) movdqu 16(%edx), %xmm1 movdqu 16(%eax), %xmm4 pxor %xmm4, %xmm1 movdqa %xmm1, %xmm5 psrlq $32, %xmm1 pand %xmm2, %xmm5 movdqa %xmm1, %xmm4 movdqa %xmm5, %xmm0 movdqa %xmm5, %xmm3 psrlq $32, %xmm0 psrlq $32, %xmm4 pmuludq %xmm1, %xmm3 pmuludq %xmm0, %xmm1 movdqa %xmm4, %xmm0 pmuludq %xmm5, %xmm0 paddq %xmm0, %xmm1 movdqu 16(%eax), %xmm0 paddq 16(%ebx), %xmm0 psllq $32, %xmm1 paddq %xmm3, %xmm1 paddq %xmm0, %xmm1 movaps %xmm1, 16(%ebx) movdqu 32(%edx), %xmm1 movdqu 32(%eax), %xmm4 pxor %xmm4, %xmm1 movdqa %xmm1, %xmm5 psrlq $32, %xmm1 pand %xmm2, %xmm5 movdqa %xmm1, %xmm4 movdqa %xmm5, %xmm0 movdqa %xmm5, %xmm3 psrlq $32, %xmm0 psrlq $32, %xmm4 pmuludq %xmm1, %xmm3 pmuludq %xmm0, %xmm1 movdqa %xmm4, %xmm0 pmuludq %xmm5, %xmm0 paddq %xmm0, %xmm1 movdqu 32(%eax), %xmm0 paddq 32(%ebx), %xmm0 psllq $32, %xmm1 paddq %xmm3, %xmm1 paddq %xmm0, %xmm1 movdqu 48(%edx), %xmm0 movaps %xmm1, 32(%ebx) movdqu 48(%eax), %xmm3 pxor %xmm3, %xmm0 movdqa %xmm0, %xmm4 pand %xmm2, %xmm0 movdqa %xmm0, %xmm5 psrlq $32, %xmm4 paddq 48(%ebx), %xmm3 movdqa %xmm4, %xmm1 psrlq $32, %xmm5 movdqa %xmm4, %xmm2 psrlq $32, %xmm1 pmuludq %xmm0, %xmm2 pmuludq %xmm1, %xmm0 movdqa %xmm5, %xmm1 pmuludq %xmm4, %xmm1 paddq %xmm1, %xmm0 psllq $32, %xmm0 paddq %xmm2, %xmm0 paddq %xmm3, %xmm0 movaps %xmm0, 48(%ebx) L1: addl $412, %esp LCFI5: popl %ebx LCFI6: popl %esi LCFI7: popl %edi LCFI8: popl %ebp LCFI9: ret L11: LCFI10: movq (%eax), %xmm1 movl %ecx, %esi movl 396(%esp), %eax movq _kKey-L1$pb(%eax,%ebp), %xmm0 leal _kKey-L1$pb(%eax), %ecx pxor %xmm1, %xmm0 movd %xmm0, %edx psrlq $32, %xmm0 movd %xmm0, %eax mull %edx movd %edx, %xmm2 movd %eax, %xmm0 punpckldq %xmm2, %xmm0 paddq %xmm1, %xmm0 movq (%ebx), %xmm1 paddq %xmm1, %xmm0 movq %xmm0, (%ebx) movq 8(%ecx,%ebp), %xmm0 movq -56(%edi,%esi), %xmm1 pxor %xmm1, %xmm0 movd %xmm0, %edx psrlq $32, %xmm0 movd %xmm0, %eax mull %edx movd %edx, %xmm2 movd %eax, %xmm0 punpckldq %xmm2, %xmm0 paddq %xmm1, %xmm0 movq 8(%ebx), %xmm1 paddq %xmm1, %xmm0 movq %xmm0, 8(%ebx) movq 16(%ecx,%ebp), %xmm0 movq -48(%edi,%esi), %xmm1 pxor %xmm1, %xmm0 movd %xmm0, %edx psrlq $32, %xmm0 movd %xmm0, %eax mull %edx movd %edx, %xmm2 movd %eax, %xmm0 punpckldq %xmm2, %xmm0 paddq %xmm1, %xmm0 movq 16(%ebx), %xmm1 paddq %xmm1, %xmm0 movq %xmm0, 16(%ebx) movq 24(%ecx,%ebp), %xmm0 movq -40(%edi,%esi), %xmm1 pxor %xmm1, %xmm0 movd %xmm0, %edx psrlq $32, %xmm0 movd %xmm0, %eax mull %edx movd %edx, %xmm2 movd %eax, %xmm0 punpckldq %xmm2, %xmm0 paddq %xmm1, %xmm0 movq 24(%ebx), %xmm1 paddq %xmm1, %xmm0 movq %xmm0, 24(%ebx) movq 32(%ecx,%ebp), %xmm0 movq -32(%edi,%esi), %xmm1 pxor %xmm1, %xmm0 movd %xmm0, %edx psrlq $32, %xmm0 movd %xmm0, %eax mull %edx movd %edx, %xmm2 movd %eax, %xmm0 punpckldq %xmm2, %xmm0 paddq %xmm1, %xmm0 movq 32(%ebx), %xmm1 paddq %xmm1, %xmm0 movq %xmm0, 32(%ebx) movq 40(%ecx,%ebp), %xmm0 movq -24(%edi,%esi), %xmm1 pxor %xmm1, %xmm0 movd %xmm0, %edx psrlq $32, %xmm0 movd %xmm0, %eax mull %edx movd %edx, %xmm2 movd %eax, %xmm0 punpckldq %xmm2, %xmm0 paddq %xmm1, %xmm0 movq 40(%ebx), %xmm1 paddq %xmm1, %xmm0 movq %xmm0, 40(%ebx) movq 48(%ecx,%ebp), %xmm0 movq -16(%edi,%esi), %xmm1 pxor %xmm1, %xmm0 movd %xmm0, %edx psrlq $32, %xmm0 movd %xmm0, %eax mull %edx movd %edx, %xmm2 movd %eax, %xmm0 punpckldq %xmm2, %xmm0 paddq %xmm1, %xmm0 movq 48(%ebx), %xmm1 paddq %xmm1, %xmm0 movq %xmm0, 48(%ebx) movq 56(%ecx,%ebp), %xmm0 movq -8(%edi,%esi), %xmm1 pxor %xmm1, %xmm0 movd %xmm0, %edx psrlq $32, %xmm0 movd %xmm0, %eax mull %edx movd %eax, %xmm0 movd %edx, %xmm2 punpckldq %xmm2, %xmm0 paddq %xmm1, %xmm0 movq 56(%ebx), %xmm1 paddq %xmm1, %xmm0 movq %xmm0, 56(%ebx) addl $412, %esp LCFI11: popl %ebx LCFI12: popl %esi LCFI13: popl %edi LCFI14: popl %ebp LCFI15: ret LFE5372: ```

(sorry about the AT&T syntax, GCC for macOS doesn't support Intel syntax)

Not a huge deal because GCC will be using the SSE2 path in that case, so this isn't too much to worry about.

Wait, I figured out the difference, I passed kKey as a pointer to XXH3_hashLong (so I didn't have duped keys). Passing kKey as a pointer to XXH3_hashLong gets 3.2 GB/s.

Also, I found my ancient P3 laptop, but I can't seem to find the power cord. It's one of those things where you find it when you are looking for something else but not when you are looking for it. 😒

easyaspi314 commented 5 years ago

I guess this is the case I had before with GCC for Thumb. GCC figures out it can optimize a constant and does it in the absolute worst way possible.

I think I'm gonna call it "constant propagurgitation".

easyaspi314 commented 5 years ago

https://github.com/easyaspi314/xxhash/tree/multitarget

I uploaded my changes. I have some of the updated algorithm, the dispatcher, updated scalar code, and a few other goodies. I'm not opening a PR until you tell me how that new code works out.

easyaspi314 commented 5 years ago

Also, I need to fix these god damn ugly merge conflicts. 😒

#if !defined(XXH3_TARGET_C) && defined(__GNUC__)
__attribute__((__constructor__))
#endif
static void
XXH3_featureTest(void)
{
    int max, data[4];
    /* First, get how many CPUID function parameters there are by calling CPUID with eax = 0. */
    XXH_CPUID(data, /* eax */ 0);
    max = data[0];
    /* AVX2 is on the Extended Features page (eax = 7, ecx = 0), on bit 5 of ebx. */
    if (max >= 7) {
        XXH_CPUIDEX(data, /* eax */ 7, /* ecx */ 0);
        if (data[1] & (1 << 5)) {
            cpu_mode = XXH_CPU_MODE_AVX2;
            return;
        }
    }
<<<<<<< HEAD
    /* SSE2 is on the Processor Info and Feature Bits page (eax = 1), on bit 26 of edx. */
    if (max >= 1) {
        XXH_CPUID(data, /* eax */ 1);
        if (data[3] & (1 << 26)) {
            cpu_mode = XXH_CPU_MODE_SSE2;
            return;
        }
=======

#else   /* scalar variant of Accumulator - universal */

          U64* const xacc  =       (U64*) acc;   /* presumed aligned */
    const U32* const xdata = (const U32*) data;
    const U32* const xkey  = (const U32*) key;

    int i;
    for (i=0; i < (int)ACC_NB; i++) {
        int const left = 2*i;
        int const right= 2*i + 1;
        U32 const dataLeft  = XXH_readLE32(xdata + left);
        U32 const dataRight = XXH_readLE32(xdata + right);
        xacc[i] += XXH_mult32to64(dataLeft ^ xkey[left], dataRight ^ xkey[right]);
        xacc[i] += dataLeft + ((U64)dataRight << 32);
>>>>>>> 7efa77614832b96eb78a286b3607a97bfa188276
    }
    /* Must be scalar. */
    cpu_mode = XXH_CPU_MODE_SCALAR;
}

easyaspi314 commented 5 years ago

~/xxh3-fix $ ./xxhsum -b -m2
./xxhsum 0.7.0 (64-bits x86_64 + SSE2 little endian), Clang 8.0.0 (tags/RELEASE_800/final), by Yann Collet
Sample of 100 KB...
XXH32               :     102400 ->    49052 it/s ( 4790.3 MB/s)
XXH32 unaligned     :     102400 ->    52240 it/s ( 5101.5 MB/s)
XXH64               :     102400 ->    72225 it/s ( 7053.2 MB/s)
XXH64 unaligned     :     102400 ->    72505 it/s ( 7080.6 MB/s)
Illegal instruction: 4    102400 ->
~/xxh3-fix $ ./xxhsum -b -m1
./xxhsum 0.7.0 (64-bits x86_64 + SSE2 little endian), Clang 8.0.0 (tags/RELEASE_800/final), by Yann Collet
Sample of 100 KB...
XXH32               :     102400 ->    50042 it/s ( 4886.9 MB/s)
XXH32 unaligned     :     102400 ->    49664 it/s ( 4850.0 MB/s)
XXH64               :     102400 ->    75342 it/s ( 7357.6 MB/s)
XXH64 unaligned     :     102400 ->    72363 it/s ( 7066.7 MB/s)
XXH3_64bits         :     102400 ->   168694 it/s (16474.0 MB/s)
XXH3_64b unaligned  :     102400 ->   172788 it/s (16873.8 MB/s)
~/xxh3-fix $ ./xxhsum -b -m0
./xxhsum 0.7.0 (64-bits x86_64 + SSE2 little endian), Clang 8.0.0 (tags/RELEASE_800/final), by Yann Collet
Sample of 100 KB...
XXH32               :     102400 ->    50954 it/s ( 4976.0 MB/s)
XXH32 unaligned     :     102400 ->    49851 it/s ( 4868.3 MB/s)
XXH64               :     102400 ->    75145 it/s ( 7338.4 MB/s)
XXH64 unaligned     :     102400 ->    71043 it/s ( 6937.8 MB/s)
XXH3_64bits         :     102400 ->    87310 it/s ( 8526.4 MB/s)
XXH3_64b unaligned  :     102400 ->    85418 it/s ( 8341.6 MB/s)

I added the -m switch to xxhsum which will select scalar, sse2 or avx2 code paths depending on if it is 0, 1, or 2 respectively

As you can see, -m2 crashes because it tries to use AVX2, -m1 gives me the best performance, which in my case is the SSE2 version, and -m0 only gives me decent performance (no, it's not faster than XXH64, that is a clang bug) because it is the scalar code.

EDIT: In order to use the dispatching, you need to do make MULTI_TARGET=1.

easyaspi314 commented 5 years ago

For MSVC, if someone could try the multitargeting code, this should compile it:

> cl.exe -O2 -c xxhash.c -DXXH_MULTI_TARGET
> cl.exe -O2 -c xxh3-target.c -DXXH_MULTI_TARGET -DXXH_VECTOR=0 -o xxh3-scalar.obj
> cl.exe -O2 -c xxh3-target.c -DXXH_MULTI_TARGET -DXXH_VECTOR=1 -arch:SSE2 -o xxh3-sse2.obj
> cl.exe -O2 -c xxh3-target.c -DXXH_MULTI_TARGET -DXXH_VECTOR=2 -arch:AVX2 -o xxh3-avx2.obj
> cl.exe -O2 -c xxhsum.c -DXXH_MULTI_TARGET
> cl.exe xxhash.obj xxhsum.obj xxh3-scalar.obj xxh3-sse2.obj xxh3-avx2.obj -o xxhsum.exe

Untested for now, my sister had the laptop all day.

easyaspi314 commented 5 years ago

I fixed the code for MSVC, and here are the correct commands:

> cl.exe -O2 -c xxhash.c -DXXH_MULTI_TARGET=1
> cl.exe -O2 -c xxh3-target.c -DXXH_MULTI_TARGET=1 -DXXH_VECTOR=0 -Foxxh3-scalar.obj
> cl.exe -O2 -c xxh3-target.c -DXXH_MULTI_TARGET=1 -DXXH_VECTOR=1 -arch:SSE2 -Foxxh3-sse2.obj
> cl.exe -O2 -c xxh3-target.c -DXXH_MULTI_TARGET=1 -DXXH_VECTOR=2 -arch:AVX2 -Foxxh3-avx2.obj
> cl.exe -O2 -c xxhsum.c -DXXH_MULTI_TARGET=1
> cl.exe -O2 xxhsum.obj xxhash.obj xxh3-scalar.obj xxh3-sse2.obj xxh3-avx2.obj

MSVC stands true to its name: MSVC Sucks at Vectorizing Code. The AVX2 code is not much faster than the SSE2 code, while even GCC's version had a notable speedup with AVX2. I think it is not unrolling the loop.

Edit: MSVC for x64 will warn with -arch:SSE2 on non-x86. You can ignore it though.

Cyan4973 commented 5 years ago

FYI, regarding the scrambler : we'll keep the new scrambler formula, as it has higher quality.

The new scrambler is fully bijective, while the previous one incurred a little space reduction, for both variants (^ and +, + was sensibly better, but not fully bijective).

Note that XXH3 nonetheless met 64-bits collision rate objectives even with the old scrambler, likely because the scrambler stage is "rare" (once per KB), and the final stage involves mixing 8 accumulators (512 bits -> 64 bits), so even if space reduction was started a bit before, it's no longer a big deal after that final mixing. But I nonetheless value bijective property for the scrambler stage, from a collision rate perspective, it carries a higher quality tag.

easyaspi314 commented 5 years ago

When you're done tuning, could you update the test values for me so I know we are on the right page?

Cyan4973 commented 5 years ago

Sure, I will. branch xxh3 will be merged when XXH3_64bits() get ready, it will contain all associated test codes.

There will still be some work to do on XXH128(), but that part can be done separately.

easyaspi314 commented 5 years ago

Update on NEON32: Yeah, pragma clang loop unroll is not good on ARM.

I think it has to do with the tiny 64 byte L1 cache line. (And 32 KB instruction cache total)

unroll_count(2) gives me the best performance.

unroll_count(1) (no unroll): 7.4 unroll_count(2): 8.1 unroll_count(4): 6.2 unroll(enable): 5.2

How about we hand unroll the main loop twice and tell Clang to completely unroll on x86?

#if defined(__clang__) && !defined(__OPTIMIZE_SIZE__) && (defined(__i386__) || defined(__x86_64__))
#   pragma clang loop unroll(enable)
#endif
for (i = 0; i < NB_KEYS / 2; i++) {
     XXH3_accumulate512(...);
     XXH3_accumulate512(...);
}

easyaspi314 commented 5 years ago

Wait, I think it is because XXH3 might entirely fit in cache that way. My MacBook also has a 64 byte cache line. :thinking: Screen Shot 2019-03-28 at 8 24 11 AM Screen Shot 2019-03-28 at 8 23 55 AM

easyaspi314 commented 5 years ago

As for scalar ARM (thumbv6t2) I am getting 2.1 GB/s right now.

I had to use inline assembly because Clang assumed that the pointer was aligned and caused a bus error.

[termux ~/xxh3] $ ./xxhsum -b
./xxhsum 0.7.0 (32-bits arm little endian), Clang 7.0.1 (tags/RELEASE_701/final), by Yann Collet
Sample of 100 KB...
XXH32               :     102400 ->    20235 it/s ( 1976.1 MB/s)
XXH32 unaligned     :     102400 ->    18670 it/s ( 1823.2 MB/s)
XXH64               :     102400 ->     8646 it/s (  844.3 MB/s)
XXH64 unaligned     :     102400 ->     8363 it/s (  816.7 MB/s)
XXH3_64bits         :     102400 ->    21500 it/s ( 2099.6 MB/s)
XXH3_64b unaligned  :     102400 ->    21515 it/s ( 2101.0 MB/s)

(Edit: This is on my G3)

Cyan4973 commented 5 years ago

Clang assumed that the pointer was aligned and caused a bus error

Does that mean the scalar code path needs to be fixed ? It shouldn't give the compiler a chance to make that mistake.

Cyan4973 commented 5 years ago

How about we hand unroll the main loop twice

A side effect is that it would add a restriction on custom vector key size. That's likely an acceptable side effect.

easyaspi314 commented 5 years ago

Does that mean the scalar code path needs to be fixed ? It shouldn't give the compiler a chance to make that mistake.

XXH_read32 and XXH_read64 need to be fixed when targeting ARMv6. The scalar code itself appears to be fine.

Clang (edit: and GCC) are adding ldmib instructions to XXH32 when targeting ARMv6t2, which requires a 32-bit aligned address.

.LBB1_3:
        ldr     r1, [r6]
        ldmib   r6, {r0, r7} // <- aaaaaa
        ldr     r4, [r6, #12]
        mla     r0, r0, r9, r3
        mla     r2, r7, r9, r2
        mla     r4, r4, r9, lr
        ror     r12, r0, #19
        mla     r0, r1, r9, r5
        ror     r7, r2, #19
        ror     r4, r4, #19
        ror     r1, r0, #19
        mul     lr, r4, r10
        mul     r2, r7, r10
        mul     r3, r12, r10
        mul     r5, r1, r10
        add     r6, r6, #16
        cmp     r6, r8
        blo     .LBB1_3

I'm trying to figure out how to fix this without inline assembly right now, as if I use inline assembly, it generates extra add instructions.

.LBB1_3:
        ldr     r1, [r0]
        mla     r1, r1, r10, r7
        ror     r12, r1, #19
        add     r1, r0, #12 @ <-
        ldr     r1, [r1]
        mul     r7, r12, r3
        mla     r1, r1, r10, r4
        ror     r6, r1, #19
        add     r1, r0, #8 @ <-
        ldr     r1, [r1]
        mul     r4, r6, r3
        mla     r1, r1, r10, r2
        ror     lr, r1, #19
        add     r1, r0, #4 @ <-
        ldr     r1, [r1]
        mul     r2, lr, r3
        mla     r1, r1, r10, r5
        add     r0, r0, #16
        cmp     r0, r9
        ror     r8, r1, #19
        mul     r5, r8, r3
        blo     .LBB1_3

What does seem to work is to use memcpy with the compiler flag -munaligned-access. That reliably disables the evil ldmib instructions. However, without -munaligned-access, it will always do byte-by-byte access.

I'll figure it out, don't hurt your head with it.

The alignment memes strike again. 🤦‍♂️

By the way, does this look correct so far?

Cyan4973 commented 5 years ago

XXH_read32 and XXH_read64 need to be fixed when targeting ARMv6.

xxhash is supposed to make the right choice automatically, depending on target. It could be that the set of macros is not correct for ARMv6.

One way to "help" it is to define XXH_FORCE_MEMORY_ACCESS at compilation time. If one of them work, then it becomes a matter of finding the right set of macros for the automatic detection to work properly on this target.

does this look correct so far?

There are minor changes that happened in the last 2 weeks. I would prefer to leave comments directly in the code, that would make it simpler for you to follow. But that doesn't work on the link provided directly. I guess I'll have to chase the corresponding commit.

easyaspi314 commented 5 years ago

Side note: This:

 ( defined(__ARM_ARCH_6__) || defined(__ARM_ARCH_6J__) \
                        || defined(__ARM_ARCH_6K__) || defined(__ARM_ARCH_6Z__) \
                        || defined(__ARM_ARCH_6ZK__) || defined(__ARM_ARCH_6T2__) )
 ( defined(__ARM_ARCH_7__) || defined(__ARM_ARCH_7A__) \
                    || defined(__ARM_ARCH_7R__) || defined(__ARM_ARCH_7M__) \
                    || defined(__ARM_ARCH_7S__) )

could easily be simplified to this:

(defined(__ARM_ARCH) && __ARM_ARCH == 6)
(defined(__ARM_ARCH) && __ARM_ARCH >= 7)

That would pick up on ARMv7VE (Cortex-A15) and aarch64, and it would look a lot nicer.

Anyways, I'll see what I can do about the memory access. I'm thinking just memcpy. Once you add -munaligned-access to the flags, the code speeds up a lot, but without it, it is completely safe from alignment memes, as while it isn't perfect, it is going to be good enough at least for now.

The only common ARMv6 device which is actually being used (and sold!) is the Nintendo 3DS. Nintendo just has a hardcore fetish for obsolete hardware and not giving up on dead consoles, and they have always been a pain.

easyaspi314 commented 5 years ago

By the way, I cleaned up my previous attempt at the folded 128-bit multiply for my clean version, this is it.

It is slightly faster than yours on Clang and significantly faster on GCC.

uint64_t mult_hd(uint64_t const lhs, uint64_t const rhs)
{
    uint64_t const lo_lo      = (lhs & 0xFFFFFFFF)  * (rhs & 0xFFFFFFFF);
    uint64_t const hi_lo      = ((lhs >> 32)        * (rhs & 0xFFFFFFFF)) + (lo_lo >> 32);
    uint64_t const lo_hi      = ((lhs & 0xFFFFFFFF) * (rhs >> 32))        + (hi_lo & 0xFFFFFFFF);
    uint64_t const product_hi = ((lhs >> 32)        * (rhs >> 32))        + (lo_hi >> 32) + (hi_lo >> 32);
    uint64_t const product_lo = (lo_lo & 0xFFFFFFFF) | (lo_hi << 32);

    return product_hi ^ product_lo;
}

gcc 8.3.0 -m32 -O3 -march=i686

mult_yann: 3.083638 5278c6468d75cb00
mult_hd: 2.321105 5278c6468d75cb00
mult_njuffa: 2.559749 5278c6468d75cb00
mult_accu: 2.511064 5278c6468d75cb00
mult_code_project: 2.418673 5278c6468d75cb00
mul_botan: 2.533887 5278c6468d75cb00

clang 8.0.0 -O3 -m32 -march=i686 

mult_yann: 2.124179 7be8a9a01751f680
mult_hd: 2.022154 7be8a9a01751f680
mult_njuffa: 2.300277 7be8a9a01751f680
mult_accu: 1.980426 7be8a9a01751f680
mult_code_project: 1.974299 7be8a9a01751f680
mul_botan: 2.270755 7be8a9a01751f680

Note: mult_code_project, mult_accu, and mult_hd all generate the same assembly from Clang, mainly because they are all based on the same code:

_mult_hd:                               ## @mult_hd
## %bb.0:
        push    ebp
        push    ebx
        push    edi
        push    esi
        push    eax
        mov     ecx, dword ptr [esp + 32]
        mov     ebx, dword ptr [esp + 36]
        mov     esi, dword ptr [esp + 24]
        mov     eax, ecx
        mul     esi
        mov     edi, edx
        mov     dword ptr [esp], eax    ## 4-byte Spill
        mov     eax, ecx
        mul     dword ptr [esp + 28]
        mov     ecx, edx
        mov     ebp, eax
        mov     eax, ebx
        mul     esi
        mov     esi, edx
        mov     ebx, eax
        mov     eax, dword ptr [esp + 36]
        mul     dword ptr [esp + 28]
        add     ebp, edi
        adc     eax, ecx
        adc     edx, 0
        add     ebp, ebx
        adc     eax, esi
        adc     edx, 0
        xor     eax, dword ptr [esp]    ## 4-byte Folded Reload
        xor     edx, ebp
        add     esp, 4
        pop     esi
        pop     edi
        pop     ebx
        pop     ebp
        ret

Unfortunately, GCC generates considerably worse code on all of them, but this remains the best version:

_mult_hd:
        sub     esp, 44
        mov     dword ptr [esp + 28], ebx
        mov     ecx, dword ptr [esp + 48]
        mov     dword ptr [esp + 40], ebp
        mov     ebx, dword ptr [esp + 56]
        mov     dword ptr [esp + 36], edi
        mov     ebp, dword ptr [esp + 52]
        mov     dword ptr [esp + 32], esi
        mov     eax, ecx
        mul     ebx
        mov     dword ptr [esp + 4], edx
        mov     dword ptr [esp], eax
        mov     edi, dword ptr [esp + 4]
        mov     eax, ebx
        mul     ebp
        mov     esi, edi
        xor     edi, edi
        add     esi, eax
        mov     eax, ecx
        adc     edi, edx
        mul     dword ptr [esp + 60]
        mov     ecx, eax
        mov     ebx, edx
        mov     eax, ebp
        xor     edx, edx
        add     ecx, esi
        mov     dword ptr [esp + 8], ecx
        adc     ebx, edx
        xor     esi, esi
        mul     dword ptr [esp + 60]
        mov     dword ptr [esp + 12], ebx
        mov     ebx, dword ptr [esp]
        mov     ebp, dword ptr [esp + 12]
        mov     ecx, dword ptr [esp + 8]
        mov     dword ptr [esp + 16], eax
        mov     eax, edi
        add     eax, dword ptr [esp + 16]
        mov     dword ptr [esp + 20], edx
        mov     edx, esi
        mov     edi, ebp
        adc     edx, dword ptr [esp + 20]
        mov     esi, edi
        xor     ebp, ebp
        add     eax, esi
        mov     esi, dword ptr [esp + 32]
        mov     edi, eax
        mov     eax, ebx
        mov     ebx, dword ptr [esp + 28]
        adc     edx, ebp
        xor     eax, edi
        mov     ebp, dword ptr [esp + 40]
        xor     edx, ecx
        mov     edi, dword ptr [esp + 36]
        add     esp, 44
        ret

A lot of people from the GNU community ask why I dislike GCC a lot.

I personally believe I have some valid reasons.

Cyan4973 commented 5 years ago

We can certainly make those changes, they are nice improvements.

easyaspi314 commented 5 years ago

As for documentation, here is my idea for XXH3_accumulate_512.

/* This is the main mixer for large data blocks. This is written in SIMD code if possible.
 *
 *    acc = ((data \oplus key) \mod 2 ^ {32} * \frac {data \oplus key} {2 ^ {32}}) + data + acc
 *
 * This is a modified version of the long multiply method used in UMAC and FARSH.
 * The changes are that data and key are xored instead of added, and the original data
 * is directly added to the mix after the multiply to prevent multiply-by-zero issues. */

What is your thoughts of putting LaTeX in the comments? Wikipedia has explained some of them that way.

It expands to this: png latex

Cyan4973 / xxHash

General discussion about XXH3 #175