Tarsnap / libcperciva

BSD-licensed C99/POSIX library code shared between tarsnap, scrypt, kivaloo, spiped, and bsdiff.
Other
112 stars 11 forks source link

Elasticarray perftest #502

Closed gperciva closed 1 year ago

gperciva commented 1 year ago

As an example of the visualization, here's what the proposed changes to the "check overflow" for elasticarray_append() and elasticarray_shrink() from https://github.com/Tarsnap/tarsnap/pull/592:

elasticarray-perftest

Boxplots of 10 repetitions; the timing values are extremely close, so we don't see much of a "box" here!

(This is with clang -O2 -g; the proposed changes are clearly slower. I'm surprised they makes that much of a difference, though!)

cperciva commented 1 year ago

My impulse would be to report in units of nanoseconds per iteration (clock cycles per iteration would be even better, but I don't think we can portably get the clock speed?) since "20 ns per append" is more immediately meaningful than "0.02 s per million appends".

But the test looks right.

gperciva commented 1 year ago

Updated with a rebase; here's a new plot.

elasticarray-perftest

dorjoy03 commented 1 year ago

@gperciva This is really nice stuff! Thanks! (I closed the 592 PR but let me know if there is anything I can try out to help!)

I did some measurements on my machine too and in every case the changes from 592 are slower. I only see that the 'nrec * reclen' expression is reused down the line when calculating nsize in the old version which is lost in the new version but other than that I don't know what is the reason for the 'old' version to be faster. What do you think makes the old version faster and the new one slower?

gperciva commented 1 year ago

LOLMAO! I was vaguely curious, so I dug into it. I'm so glad that I did!

1) I generated assembly with llvm-objdump --no-leading-addr -S elasticarray.o > old.txt

(I did the same for the post-modified code, stored in `new.txt`)

2) The assembly in new.txt looks like you'd expect:

;       if (nrec > ((SIZE_MAX - EA->size) / reclen)) {
 4c 89 e8                               movq    %r13, %rax
 48 f7 d0                               notq    %rax
 31 d2                                  xorl    %edx, %edx
 48 f7 f1                               divq    %rcx
 4c 39 c0                               cmpq    %r8, %rax
 73 16                                  jae     0x1dd <elasticarray_append+0x3d>
;               errno = ENOMEM;

Namely, it sets up some stuff, then does a division divq, compare cmpq, and jump jae.

3) But old.txt doesn't contain any division in the <elasticarray_append> portion:

;       if ((nrec > SIZE_MAX / reclen) ||
 48 89 c8                               movq    %rcx, %rax
 48 f7 e2                               mulq    %rdx
 70 15                                  jo      0x1d1 <elasticarray_append+0x31>
 49 89 cf                               movq    %rcx, %r15
 48 89 fb                               movq    %rdi, %rbx
 48 8b 3f                               movq    (%rdi), %rdi
;           (nrec * reclen > SIZE_MAX - EA->size)) {
 4d 0f af fe                            imulq   %r14, %r15
 49 89 fd                               movq    %rdi, %r13
 4d 01 fd                               addq    %r15, %r13
;       if ((nrec > SIZE_MAX / reclen) ||
 73 23                                  jae     0x1f4 <elasticarray_append+0x54>
;               errno = ENOMEM;

The important (and hilarious) stuff is the 2nd and 3rd lines of assembly:

 48 f7 e2                               mulq    %rdx
 70 15                                  jo      0x1d1 <elasticarray_append+0x31>

Namely, even though the C code is nrec > SIZE_MAX / reclen (so written to avoid overflowing), the assembly code is deliberately doing nrec * reclen. But if that overflows, the overflow flag gets set. The jump jo is "Jump if overflow".

That trick allows the CPU to do these checks with only multiplication and subtraction; avoiding division completely.

(NB: I was testing with clang 16 on amd64. Results may differ with other compilers, and on other architectures. That said, this type of optimization with unsigned integers is pretty simple as far as compile tricks go, so I'd expect it to be used on any cpu architecture which has an overflow flag.)

dorjoy03 commented 1 year ago

Namely, even though the C code is nrec > SIZE_MAX / reclen (so written to avoid overflowing), the assembly code is deliberately doing nrec * reclen. But if that overflows, the overflow flag gets set. The jump jo is "Jump if overflow".

hmm nice! That means the compiler is able to recognize these expressions because of the simplicity and SIZE_MAX being the value equal to whatever the width of size_t is, which makes sense too. I guess this is a general usecase for optimization for the compiler anyway i.e., expressions that really are equivalent to 'overflow' checks can be done by multiplication and then checking overflow flag.

Thanks for digging into this. I learned some good new things!