The generated asm code also doesn't suggest pipelining friendliness.
8BC1 mov eax, ecx
C1E810 shr eax, 16
33C1 xor eax, ecx
69C86BCAEB85 imul ecx, eax, 0xFFFFFFFF85EBCA6B
8BC1 mov eax, ecx ;in order to start the calculations on the second line we need the result from ecx (previous line)
C1E80D shr eax, 13
33C1 xor eax, ecx
69C835AEB2C2 imul ecx, eax, 0xFFFFFFFFC2B2AE35
8BC1 mov eax, ecx ;here again we need the result from the previous line in order to calculate the return value
C1E810 shr eax, 16
33C1 xor eax, ecx
Are you sure this code is pipelining friendly? It looks quite the opposite to me. Every line relies on the result from the previous line. https://github.com/Wsm2110/Faster.Map/blob/10ed34a1f5c3428cdb1dd5910e0597645343d895/src/DenseMapSIMD.cs#L812-L815
The generated asm code also doesn't suggest pipelining friendliness.
Am I missing something?