Open erthink opened 4 years ago
congraduations! my Russian friend!
More details:
mul_64x64_128()
at all.t1ha_v3
:
E2K and x86 with AVX512
By targeting that, you support:
vpmullq
has 15 cycles of latency on Skylake-X. It is faster to do 4 individual imulq
instructions and use ILP, unless you are in a scenario where the cost is made up by the efficiency of the other operations.AVX2, Neon, SSE2
By targeting this, you support:
As for SSE, if you don't want to go as low as SSE2, I suggest SSE4.1, as that targets the important Core 2 Duo which is still found in the consumer market. You still need a compatibility check for that, as you can still run into the occasional Conroe or Pentium 4.
Target the present, stay compatible with the past, and leave room for the future.
You don't have to go so far as @Cyan4973 and I went with XXH3 (we wanted to keep the minimum close to XXH32), but keep in mind that the consumer will not be using the latest and greatest (or know what the most optimized program for their CPU is, instead downloading the 32-bit version because "it works")
Programmers like things they can just drop in without worrying about compatibility/slowing down.
Moreover, I got information (but don't ask me, please) that allows me to conclude that in all future generations of processors, the wide multiplication will be performed slightly slower (i.e. x86), and on architectures with a separate instruction (ARM, MIPS), multiplication will always start from scratch (without trying to use the result of the previous narrow multiplication). This is the complete opposite of knowing a few years ago. So, no
mul_64x64_128()
at all.
That is true, which is why XXH3 only uses it in medium lengths, as these multiplies cost 11 cycles on aarch64, 12 cycles on ARMv6, and are pretty expensive on every other 32-bit architecture.
As for aarch64, the cycle counts suggest that it only has a hardware 32->64 multiply and it emulates the full multiplies the same way it would be done on 32-bit in micro-ops.
That is why XXH64 is rather mediocre on it, only getting 3080 MB/s on the xxHash bench, compared to 3100 MB/s on XXH32 and 5800 MB/s on XXH3.
@easyaspi314, thank for feedback!
I'm doing a very different job now, but I think reasonable to clarify some aspect:
Wow, (magically) my t1ha_v3 draft have a lot in common with SHISHUA by @espadrine. If things go on like this, there's nothing left for me.
newton and leibniz are both greatest mathematicians who independently invented calculus. look towards for your great work~
t1ha3_dirty
= favour speed over quality: non-uniform, all hardware tricks (ie. AES-NI), one injection point and blinding multiplication are allowed;t1ha3_fair
= favour quality over speed: two injection points, no known drawbacks.t1ha3_dirty
: should be faster than competitors (e.g.xxHash_v3
,MeowHash
andwyhash_v777
), while not inferior in quality; Or at least the same speed with better quality (i.e. no several drawbacks).t1ha3_fair
: proven excellence in the quality of competitors, with minimal performance degradation.t1ha3_dirty
: the native byte order is always used;t1ha3_fair
: Little Endian byte order is primary, bswap on BE platforms.Schedule: