Closed gperciva closed 1 year ago
My impulse would be to report in units of nanoseconds per iteration (clock cycles per iteration would be even better, but I don't think we can portably get the clock speed?) since "20 ns per append" is more immediately meaningful than "0.02 s per million appends".
But the test looks right.
Updated with a rebase; here's a new plot.
@gperciva This is really nice stuff! Thanks! (I closed the 592 PR but let me know if there is anything I can try out to help!)
I did some measurements on my machine too and in every case the changes from 592 are slower. I only see that the 'nrec * reclen' expression is reused down the line when calculating nsize in the old version which is lost in the new version but other than that I don't know what is the reason for the 'old' version to be faster. What do you think makes the old version faster and the new one slower?
LOLMAO! I was vaguely curious, so I dug into it. I'm so glad that I did!
1) I generated assembly with llvm-objdump --no-leading-addr -S elasticarray.o > old.txt
(I did the same for the post-modified code, stored in `new.txt`)
2) The assembly in new.txt
looks like you'd expect:
; if (nrec > ((SIZE_MAX - EA->size) / reclen)) {
4c 89 e8 movq %r13, %rax
48 f7 d0 notq %rax
31 d2 xorl %edx, %edx
48 f7 f1 divq %rcx
4c 39 c0 cmpq %r8, %rax
73 16 jae 0x1dd <elasticarray_append+0x3d>
; errno = ENOMEM;
Namely, it sets up some stuff, then does a division divq
, compare cmpq
, and jump jae
.
3) But old.txt
doesn't contain any division in the <elasticarray_append>
portion:
; if ((nrec > SIZE_MAX / reclen) ||
48 89 c8 movq %rcx, %rax
48 f7 e2 mulq %rdx
70 15 jo 0x1d1 <elasticarray_append+0x31>
49 89 cf movq %rcx, %r15
48 89 fb movq %rdi, %rbx
48 8b 3f movq (%rdi), %rdi
; (nrec * reclen > SIZE_MAX - EA->size)) {
4d 0f af fe imulq %r14, %r15
49 89 fd movq %rdi, %r13
4d 01 fd addq %r15, %r13
; if ((nrec > SIZE_MAX / reclen) ||
73 23 jae 0x1f4 <elasticarray_append+0x54>
; errno = ENOMEM;
The important (and hilarious) stuff is the 2nd and 3rd lines of assembly:
48 f7 e2 mulq %rdx
70 15 jo 0x1d1 <elasticarray_append+0x31>
Namely, even though the C code is nrec > SIZE_MAX / reclen
(so written to avoid overflowing), the assembly code is deliberately doing nrec * reclen
. But if that overflows, the overflow flag gets set. The jump jo
is "Jump if overflow".
That trick allows the CPU to do these checks with only multiplication and subtraction; avoiding division completely.
(NB: I was testing with clang 16 on amd64. Results may differ with other compilers, and on other architectures. That said, this type of optimization with unsigned integers is pretty simple as far as compile tricks go, so I'd expect it to be used on any cpu architecture which has an overflow flag.)
Namely, even though the C code is
nrec > SIZE_MAX / reclen
(so written to avoid overflowing), the assembly code is deliberately doingnrec * reclen
. But if that overflows, the overflow flag gets set. The jumpjo
is "Jump if overflow".
hmm nice! That means the compiler is able to recognize these expressions because of the simplicity and SIZE_MAX being the value equal to whatever the width of size_t is, which makes sense too. I guess this is a general usecase for optimization for the compiler anyway i.e., expressions that really are equivalent to 'overflow' checks can be done by multiplication and then checking overflow flag.
Thanks for digging into this. I learned some good new things!
As an example of the visualization, here's what the proposed changes to the "check overflow" for
elasticarray_append()
andelasticarray_shrink()
from https://github.com/Tarsnap/tarsnap/pull/592:Boxplots of 10 repetitions; the timing values are extremely close, so we don't see much of a "box" here!
(This is with
clang -O2 -g
; the proposed changes are clearly slower. I'm surprised they makes that much of a difference, though!)