Closed andrewthad closed 4 years ago
Uh-oh. This is correct, but the performance is awful. The problem is that, rather than using a few registers like I would expect, GHC uses all the registers and then spills onto the stack. Here's the cmm:
[reserved#_entry() // [R2]
{ info_tbls: [(c1aKA,
label: reserved#_info
rep: HeapRep static { Fun {arity: 1 fun_type: ArgSpec 4} }
srt: Nothing)]
stack_info: arg_space: 8 updfr_space: Just 8
}
{offset
c1aKA:
(_c1aKF::I64) = call MO_Ctz W64(R2 & 4294967040 ^ 3405803776);
(_c1aKS::I64) = call MO_Ctz W64(R2 & 4294967040 ^ 3325256704);
(_c1aL5::I64) = call MO_Ctz W64(R2 & 4294967040 ^ 3227017984);
(_c1aLi::I64) = call MO_Ctz W64(R2 & 4294967040 ^ 3221225984);
(_c1aLv::I64) = call MO_Ctz W64(R2 & 4294967040 ^ 3221225472);
(_c1aLI::I64) = call MO_Ctz W64(R2 & 4294901760 ^ 3232235520);
(_c1aLV::I64) = call MO_Ctz W64(R2 & 4294901760 ^ 2851995648);
(_c1aM8::I64) = call MO_Ctz W64(R2 & 4294836224 ^ 3323068416);
(_c1aMl::I64) = call MO_Ctz W64(R2 & 4293918720 ^ 2886729728);
(_c1aMy::I64) = call MO_Ctz W64(R2 & 4290772992 ^ 1681915904);
(_c1aML::I64) = call MO_Ctz W64(R2 & 4278190080 ^ 2130706432);
(_c1aMV::I64) = call MO_Ctz W64(R2 & 4278190080);
(_c1aN8::I64) = call MO_Ctz W64(R2 & 4278190080 ^ 167772160);
(_c1aNl::I64) = call MO_Ctz W64(R2 & 4026531840 ^ 4026531840);
(_c1aNy::I64) = call MO_Ctz W64(R2 & 4026531840 ^ 3758096384);
R1 = _c1aNy::I64 | _c1aNl::I64 | _c1aN8::I64 | _c1aMV::I64 | _c1aML::I64 | _c1aMy::I64 | _c1aMl::I64 | _c1aM8::I64 | _c1aLV::I64 | _c1aLI::I64 | _c1aLv::I64 | _c1aLi::I64 | _c1aL5::I64 | _c1aKS::I64 | _c1aKF::I64 >> 5;
call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
}
},
section ""data" . reserved#_closure" {
reserved#_closure:
const reserved#_info;
}]
And here is the assembly, truncated after the stack spilling becomes clear:
_c1aKA:
movl $3405803776,%eax
movl $4294967040,%ebx
movq %r14,%rcx
andq %rbx,%rcx
xorq %rax,%rcx
bsf %rcx,%rax
movl $64,%ebx
cmovne %rax,%rbx
movl $3325256704,%eax
movl $4294967040,%ecx
movq %r14,%rdx
andq %rcx,%rdx
xorq %rax,%rdx
bsf %rdx,%rax
movl $64,%ecx
cmovne %rax,%rcx
movl $3227017984,%eax
movl $4294967040,%edx
movq %r14,%rsi
andq %rdx,%rsi
xorq %rax,%rsi
bsf %rsi,%rax
movl $64,%edx
cmovne %rax,%rdx
movl $3221225984,%eax
movl $4294967040,%esi
movq %r14,%rdi
andq %rsi,%rdi
xorq %rax,%rdi
bsf %rdi,%rax
movl $64,%esi
cmovne %rax,%rsi
movl $3221225472,%eax
movl $4294967040,%edi
movq %r14,%r8
andq %rdi,%r8
xorq %rax,%r8
bsf %r8,%rax
movl $64,%edi
cmovne %rax,%rdi
movl $3232235520,%eax
movl $4294901760,%r8d
movq %r14,%r9
andq %r8,%r9
xorq %rax,%r9
bsf %r9,%rax
movl $64,%r8d
cmovne %rax,%r8
movl $2851995648,%eax
movl $4294901760,%r9d
movq %r14,%r10
andq %r9,%r10
xorq %rax,%r10
bsf %r10,%rax
movl $64,%r9d
cmovne %rax,%r9
movl $3323068416,%eax
movl $4294836224,%r10d
movq %r14,%r11
andq %r10,%r11
xorq %rax,%r11
bsf %r11,%rax
movl $64,%r10d
cmovne %rax,%r10
movl $2886729728,%eax
movl $4293918720,%r11d
movq %rbx,64(%rsp)
movq %r14,%rbx
andq %r11,%rbx
xorq %rax,%rbx
bsf %rbx,%rax
movl $64,%ebx
cmovne %rax,%rbx
movl $4290772992,%eax
movq %r14,%r11
andq %rax,%r11
xorq $1681915904,%r11
bsf %r11,%rax
movl $64,%r11d
cmovne %rax,%r11
movl $4278190080,%eax
movq %rcx,72(%rsp)
movq %r14,%rcx
andq %rax,%rcx
xorq $2130706432,%rcx
bsf %rcx,%rax
movl $64,%ecx
cmovne %rax,%rcx
movl $4278190080,%eax
movq %rdx,80(%rsp)
movq %r14,%rdx
andq %rax,%rdx
bsf %rdx,%rax
movl $64,%edx
cmovne %rax,%rdx
movl $4278190080,%eax
movq %rsi,88(%rsp)
movq %r14,%rsi
andq %rax,%rsi
xorq $167772160,%rsi
bsf %rsi,%rax
movl $64,%esi
cmovne %rax,%rsi
movl $4026531840,%eax
movq %rdi,96(%rsp)
movl $4026531840,%edi
movq %r8,104(%rsp)
movq %r14,%r8
andq %rdi,%r8
...
Darn.
Before changing the implementation of reserved:
benchmarking CIDR Inclusion/reserved
time 551.4 ns (549.1 ns .. 554.5 ns)
1.000 R² (0.999 R² .. 1.000 R²)
mean 553.1 ns (549.1 ns .. 560.5 ns)
std dev 17.57 ns (10.45 ns .. 28.96 ns)
variance introduced by outliers: 45% (moderately inflated)
After changing it by writing custom cmm (and turning on -msse4.2
and -mbmi2
):
benchmarking CIDR Inclusion/reserved
time 887.0 ns (880.9 ns .. 893.6 ns)
1.000 R² (0.999 R² .. 1.000 R²)
mean 883.1 ns (878.8 ns .. 892.1 ns)
std dev 19.85 ns (12.08 ns .. 32.64 ns)
variance introduced by outliers: 28% (moderately inflated)
It appears that this attempted optimization does not pay its way.
dang. that sucks. that most recent benchmark includes the handwritten CMM?
They do include the handwritten cmm (which requires a patch to GHC to even build). It is unfortunate since I was expecting that a non-branching approach would do well, but at least in microbenchmarks, it looks like it just doesn't work out well. However, in https://github.com/andrewthad/haskell-ip/pull/63, I've come up with something that doubles the performance of this function and compiles without fancy hacks.
Closing this PR since this appears to be a dead end.
Add tests and benchmarks reserved range testing, remove check for 255.255.255.255 since it is redundant.
This will resolve https://github.com/andrewthad/haskell-ip/issues/60 when completed.