hzhuang1 commented 1 year ago

The whole patch set is in https://github.com/Cyan4973/xxHash/pull/748.

In this patch set, some features are included.

Change dispatch breakpoint to XXH3_accumulate() (https://github.com/Cyan4973/xxHash/pull/744). This pull request is prepared for ARM SVE dispatch.
Add SVE intrinsic code for XXH3.
Use dispatch as a common framework for both x86 and aarch64. Import the assembly implementation of aarch64 SVE.

hzhuang1 commented 1 year ago

Let's start from #744.

hzhuang1 commented 1 year ago

Let's start from #744.

I thought for a while. The effect of #744 isn't intuitive. So I created #752 that just supported ARM SVE intrinsic.

In #752, we could observe the performance is even downgraded versus scalar on the test platform. But it's only the intrinsic implementation for easy reviewing and a starting point of optimization.

After #752, we could keep up on #744 that exposes XXH3_accumulate() interface to all silicons. With this self-maintained interface, we could avoid to access memory frequently without hacking XXHASH that improves the performance in huge.

When both of them are handled, we could continue on the assembly implementation.

Logically, this new sequence could be much more intuitive.

hzhuang1 commented 1 year ago

752 is merged. Thanks a lot.

Now we're moving to #756 that simplifies #744. With this patch, full accumulating loop could be customized on different architectures. On SVE, we could avoid accessing stacks and apply SVE specific prefetching instructions. The performance is improved a lot.

Cyan4973 / xxHash

Implement ARM SVE optimization with assembly code #751

752 is merged. Thanks a lot.