Open hzhuang1 opened 1 year ago
Let's start from #744.
Let's start from #744.
I thought for a while. The effect of #744 isn't intuitive. So I created #752 that just supported ARM SVE intrinsic.
In #752, we could observe the performance is even downgraded versus scalar on the test platform. But it's only the intrinsic implementation for easy reviewing and a starting point of optimization.
After #752, we could keep up on #744 that exposes XXH3_accumulate() interface to all silicons. With this self-maintained interface, we could avoid to access memory frequently without hacking XXHASH that improves the performance in huge.
When both of them are handled, we could continue on the assembly implementation.
Logically, this new sequence could be much more intuitive.
Now we're moving to #756 that simplifies #744. With this patch, full accumulating loop could be customized on different architectures. On SVE, we could avoid accessing stacks and apply SVE specific prefetching instructions. The performance is improved a lot.
The whole patch set is in https://github.com/Cyan4973/xxHash/pull/748.
In this patch set, some features are included.