Faster range tests - Githubissues

andrewthad commented 4 years ago

Add tests and benchmarks reserved range testing, remove check for 255.255.255.255 since it is redundant.

This will resolve https://github.com/andrewthad/haskell-ip/issues/60 when completed.

andrewthad commented 4 years ago

Uh-oh. This is correct, but the performance is awful. The problem is that, rather than using a few registers like I would expect, GHC uses all the registers and then spills onto the stack. Here's the cmm:

[reserved#_entry() //  [R2]
         { info_tbls: [(c1aKA,
                        label: reserved#_info
                        rep: HeapRep static { Fun {arity: 1 fun_type: ArgSpec 4} }
                        srt: Nothing)]
           stack_info: arg_space: 8 updfr_space: Just 8
         }
     {offset
       c1aKA:
           (_c1aKF::I64) = call MO_Ctz W64(R2 & 4294967040 ^ 3405803776);
           (_c1aKS::I64) = call MO_Ctz W64(R2 & 4294967040 ^ 3325256704);
           (_c1aL5::I64) = call MO_Ctz W64(R2 & 4294967040 ^ 3227017984);
           (_c1aLi::I64) = call MO_Ctz W64(R2 & 4294967040 ^ 3221225984);
           (_c1aLv::I64) = call MO_Ctz W64(R2 & 4294967040 ^ 3221225472);
           (_c1aLI::I64) = call MO_Ctz W64(R2 & 4294901760 ^ 3232235520);
           (_c1aLV::I64) = call MO_Ctz W64(R2 & 4294901760 ^ 2851995648);
           (_c1aM8::I64) = call MO_Ctz W64(R2 & 4294836224 ^ 3323068416);
           (_c1aMl::I64) = call MO_Ctz W64(R2 & 4293918720 ^ 2886729728);
           (_c1aMy::I64) = call MO_Ctz W64(R2 & 4290772992 ^ 1681915904);
           (_c1aML::I64) = call MO_Ctz W64(R2 & 4278190080 ^ 2130706432);
           (_c1aMV::I64) = call MO_Ctz W64(R2 & 4278190080);
           (_c1aN8::I64) = call MO_Ctz W64(R2 & 4278190080 ^ 167772160);
           (_c1aNl::I64) = call MO_Ctz W64(R2 & 4026531840 ^ 4026531840);
           (_c1aNy::I64) = call MO_Ctz W64(R2 & 4026531840 ^ 3758096384);
           R1 = _c1aNy::I64 | _c1aNl::I64 | _c1aN8::I64 | _c1aMV::I64 | _c1aML::I64 | _c1aMy::I64 | _c1aMl::I64 | _c1aM8::I64 | _c1aLV::I64 | _c1aLI::I64 | _c1aLv::I64 | _c1aLi::I64 | _c1aL5::I64 | _c1aKS::I64 | _c1aKF::I64 >> 5;
           call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
     }
 },
 section ""data" . reserved#_closure" {
     reserved#_closure:
         const reserved#_info;
 }]

And here is the assembly, truncated after the stack spilling becomes clear:

_c1aKA:
        movl $3405803776,%eax
        movl $4294967040,%ebx
        movq %r14,%rcx
        andq %rbx,%rcx
        xorq %rax,%rcx
        bsf %rcx,%rax
        movl $64,%ebx
        cmovne %rax,%rbx
        movl $3325256704,%eax
        movl $4294967040,%ecx
        movq %r14,%rdx
        andq %rcx,%rdx
        xorq %rax,%rdx
        bsf %rdx,%rax
        movl $64,%ecx
        cmovne %rax,%rcx
        movl $3227017984,%eax
        movl $4294967040,%edx
        movq %r14,%rsi
        andq %rdx,%rsi
        xorq %rax,%rsi
        bsf %rsi,%rax
        movl $64,%edx
        cmovne %rax,%rdx
        movl $3221225984,%eax
        movl $4294967040,%esi
        movq %r14,%rdi
        andq %rsi,%rdi
        xorq %rax,%rdi
        bsf %rdi,%rax
        movl $64,%esi
        cmovne %rax,%rsi
        movl $3221225472,%eax
        movl $4294967040,%edi
        movq %r14,%r8
        andq %rdi,%r8
        xorq %rax,%r8
        bsf %r8,%rax
        movl $64,%edi
        cmovne %rax,%rdi
        movl $3232235520,%eax
        movl $4294901760,%r8d
        movq %r14,%r9
        andq %r8,%r9
        xorq %rax,%r9
        bsf %r9,%rax
        movl $64,%r8d
        cmovne %rax,%r8
        movl $2851995648,%eax
        movl $4294901760,%r9d
        movq %r14,%r10
        andq %r9,%r10
        xorq %rax,%r10
        bsf %r10,%rax
        movl $64,%r9d
        cmovne %rax,%r9
        movl $3323068416,%eax
        movl $4294836224,%r10d
        movq %r14,%r11
        andq %r10,%r11
        xorq %rax,%r11
        bsf %r11,%rax
        movl $64,%r10d
        cmovne %rax,%r10
        movl $2886729728,%eax
        movl $4293918720,%r11d
        movq %rbx,64(%rsp)
        movq %r14,%rbx
        andq %r11,%rbx
        xorq %rax,%rbx
        bsf %rbx,%rax
        movl $64,%ebx
        cmovne %rax,%rbx
        movl $4290772992,%eax
        movq %r14,%r11
        andq %rax,%r11
        xorq $1681915904,%r11
        bsf %r11,%rax
        movl $64,%r11d
        cmovne %rax,%r11
        movl $4278190080,%eax
        movq %rcx,72(%rsp)
        movq %r14,%rcx
        andq %rax,%rcx
        xorq $2130706432,%rcx
        bsf %rcx,%rax
        movl $64,%ecx
        cmovne %rax,%rcx
        movl $4278190080,%eax
        movq %rdx,80(%rsp)
        movq %r14,%rdx
        andq %rax,%rdx
        bsf %rdx,%rax
        movl $64,%edx
        cmovne %rax,%rdx
        movl $4278190080,%eax
        movq %rsi,88(%rsp)
        movq %r14,%rsi
        andq %rax,%rsi
        xorq $167772160,%rsi
        bsf %rsi,%rax
        movl $64,%esi
        cmovne %rax,%rsi
        movl $4026531840,%eax
        movq %rdi,96(%rsp)
        movl $4026531840,%edi
        movq %r8,104(%rsp)
        movq %r14,%r8
        andq %rdi,%r8
        ...

Darn.

andrewthad commented 4 years ago

Before changing the implementation of reserved:

benchmarking CIDR Inclusion/reserved
time                 551.4 ns   (549.1 ns .. 554.5 ns)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 553.1 ns   (549.1 ns .. 560.5 ns)
std dev              17.57 ns   (10.45 ns .. 28.96 ns)
variance introduced by outliers: 45% (moderately inflated)

After changing it by writing custom cmm (and turning on -msse4.2 and -mbmi2):

benchmarking CIDR Inclusion/reserved
time                 887.0 ns   (880.9 ns .. 893.6 ns)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 883.1 ns   (878.8 ns .. 892.1 ns)
std dev              19.85 ns   (12.08 ns .. 32.64 ns)
variance introduced by outliers: 28% (moderately inflated)

It appears that this attempted optimization does not pay its way.

chessai commented 4 years ago

dang. that sucks. that most recent benchmark includes the handwritten CMM?

andrewthad commented 4 years ago

They do include the handwritten cmm (which requires a patch to GHC to even build). It is unfortunate since I was expecting that a non-branching approach would do well, but at least in microbenchmarks, it looks like it just doesn't work out well. However, in https://github.com/andrewthad/haskell-ip/pull/63, I've come up with something that doubles the performance of this function and compiles without fancy hacks.

Closing this PR since this appears to be a dead end.

byteverse / haskell-ip

Faster range tests #61