Now I remember where I went wrong on rabeld. I tried to implement sse intrinsics throughout,
doing stuff like aligning
And I got nowhere for a variety of reasons, notably I couldn't keep stuff in xmm registers without
creating types for it everywhere, and you can't do a real memcmp. So I got xor to work (I should
rename these to memeqX), found out that gcc 7 on x86_64 does inline memcmp(a,b,c) != 0 with the right thing,
but the real bottleneck is in the memcmp for less than or greater than in the core find_route_slot routine.
What this does is return the first byte that differs, but you got to load it again to find out. And even this
is wrong because it checks for a null byte, apparently.
Now I remember where I went wrong on rabeld. I tried to implement sse intrinsics throughout, doing stuff like aligning
And I got nowhere for a variety of reasons, notably I couldn't keep stuff in xmm registers without creating types for it everywhere, and you can't do a real memcmp. So I got xor to work (I should rename these to memeqX), found out that gcc 7 on x86_64 does inline memcmp(a,b,c) != 0 with the right thing,
but the real bottleneck is in the memcmp for less than or greater than in the core find_route_slot routine.
What this does is return the first byte that differs, but you got to load it again to find out. And even this is wrong because it checks for a null byte, apparently.
ifdef HAVE_SSE
include
/ inline size_t xor16 (const unsigned char p1, const unsigned char p2) { return _mm_cmpistrc((const __m128i )p1,(const __m128i )p2,0); } /
inline bool xor16(const unsigned char a, const unsigned char b) { m128i xmm0, xmm1; unsigned int eax; xmm0 = _mm_loadu_si128((__m128i*)(a)); xmm1 = _mm_loadu_si128((m128i*)(b)); xmm0 = _mm_cmpeq_epi8(xmm0, xmm1); eax = _mm_movemask_epi8(xmm0); return !(eax == 0xffff); //equal }