The fastest languages are cheating

983 commented 4 years ago

When compiling Nim to C++, this is the simplified result:

#include <stdint.h>
#include <stdio.h>

uint64_t fib(uint64_t n){
    if (n > 1){
        uint64_t a = fib(n - 1);
        uint64_t b = fib(n - 2);

        return a + (b);
    }
    return 1;
}

extern "C" void foo(){
    uint64_t result = fib(46);

    printf("%lu\n", result);
}

int main(){
    foo();
}

Nothing wrong here yet. But when compiled with g++, the disassembled program will look similar to this:

; Run like so:
; nasm -felf64 fib.asm && gcc -no-pie fib.o -o fib && time ./fib
global main
extern printf

section .text

fib:
        cmp     rdi, 1
        mov     eax, 1
        ja      L15
        ret
L15:
        push    r15
        push    r14
        mov     r14, -1
        push    r13
        push    r12
        lea     r12, [rdi-2]
        push    rbp
        push    rbx
        mov     rbp, rdi
        xor     ebx, ebx
        sub     rsp, 8
L3:
        cmp     rbp, 2
        jne     L9
        cmp     r12, 1
        ja      L7
        mov     r13d, 1
L5:
        lea     rax, [r13+1+rbx]
L1:
        add     rsp, 8
        pop     rbx
        pop     rbp
        pop     r12
        pop     r13
        pop     r14
        pop     r15
        ret
L7:
        mov     rdi, r14
        sub     r12, 4
        mov     rbp, -2
        call    fib
        lea     rbx, [rax+1+rbx]
L9:
        lea     r15, [rbp-3]
        mov     rdi, r12
        call    fib
        mov     rdi, r15
        mov     r13, rax
        call    fib
        add     r13, rax
        cmp     r12, 1
        jbe     L5
        mov     rdi, r15
        sub     rbp, 4
        sub     r12, 4
        call    fib
        add     r13, rax
        add     rbx, r13
        cmp     rbp, 1
        ja      L3
        lea     rax, [rbx+1]
        jmp     L1

foo:
        push    r12
        push    rbp
        mov     edi, 40
        push    rbx
        call    fib
        mov     edi, 39
        mov     rdx, rax
        call    fib
        mov     edi, 38
        mov     rbp, rax
        call    fib
        mov     edi, 37
        lea     r11, [rax+rbp]
        mov     rcx, rax
        call    fib
        mov     edi, 36
        add     rcx, rax
        mov     rsi, rax
        lea     r10, [r11+rcx]
        call    fib
        add     rsi, rax
        mov     rbx, rax
        mov     edi, 35
        add     rcx, rsi
        lea     r9, [r10+rcx]
        call    fib
        add     rbx, rax
        mov     edi, 34
        mov     r12, rax
        add     rsi, rbx
        add     rcx, rsi
        lea     r8, [r9+rcx]
        call    fib
        add     rdx, rbp
        mov     edi, format
        add     rdx, r12
        add     rax, rdx
        add     rax, r11
        add     rax, rbx
        add     rax, r10
        add     rax, rsi
        pop     rbx
        add     rax, r9
        pop     rbp
        pop     r12
        add     rax, rcx
        lea     rsi, [rax+r8*2]
        xor     eax, eax
        jmp     printf
main:
        sub     rsp, 8
        call    foo
        xor     eax, eax
        add     rsp, 8
        ret

format: db  "%lu", 10, 0

Apparently, fib is only called for the parameters 34, 35, 36, 37, 38, 39, 40 and then the results are added up. This is not a fair comparison, because Nim is not doing the same computations as the other languages. I believe that Nim should be moved to the section Optimized. Otherwise, one would have to argue how much unrolling is ok and how much is too much, but there is no good answer for that.

Likewise, C and C++ partially unroll the fib function calls, although not as far. I have not checked the Fortran and Cython binaries, but they are probably doing the same thing.

It was suggested in another issue that the parameter n should be configurable to prevent optimization. However, that is insufficient since it still allows to unroll the fib function for small values of n. The only solution seems to be to inspect the generated assembly code

Here is a fair assembly implementation to compare against. If a program is faster, it is likely cheating in some way or another.

; Run like so:
; nasm -felf64 fib.asm && gcc -no-pie fib.o -o fib && time ./fib
global main
extern printf
section .text

fib:
    mov rax, 1
    dec rdi
    jle done
    push rdi
    call fib
    pop rdi
    push rax
    dec rdi
    call fib
    pop rdx
    add rax, rdx
done:
    ret

main:
    mov rdi, 46
    call fib

    mov rdi, format
    mov rsi, rax
    xor rax, rax
    call printf
    xor rax, rax
    ret

format: db "%lu", 10, 0

drujensen commented 4 years ago

@983 amazing investigative work on this. It's interesting to see how the compiler is optimizing the recursive calls. I am open to suggestions on how to improve the fairness of the benchmark.

983 commented 4 years ago

I can see the following options:

Look at the generated assembly code and if there is something fishy in there, move the language from the Natively compiled, statically typed to the Optimized table.
Specify the -fno-inline-small-functions flag when compiling with gcc.
Prefix functions with __attribute__((noinline)).

The two last options do not completely fix the problem, because gcc will still inline the last recursive call. To get rid of that, also specify -fno-optimize-sibling-calls.

The Nim, fortran and cython benchmark can be fixed the same way:

nim cpp -d:release --passC:-fno-inline-small-functions --passC:-fno-optimize-sibling-calls fib.nim
gfortran -fno-inline-small-functions -fno-optimize-sibling-calls -O3 -o fib fib.f03
cython --embed -o fib.pyx.c fib.pyx && gcc -fno-inline-functions -fno-optimize-sibling-calls -O3 -o fib fib.pyx.c

In case that this behavior should change in the future, here is my process to determine which flags were responsible:

Compile with decreasing levels of optimization until inlining does not happen anymore (https://gcc.godbolt.org/z/9AWZMN is helpful for that)
Find the responsible -f<flag> by trial and error until gcc optimizes too much. See https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html for the list of compiler flags per optimization level.
Finally, compile with -O3 -fno-<flag>.

Nevertheless, I think a compiler which is very good at inlining should still be honored in some way. Maybe the running time with all optimizations enabled could be included in another column or table?

xyproto commented 4 years ago

The average time 10 developers use to find the compiler flags to go from "regularly compiled code" to "optimized code" could be included in the table. This would make the results for the optimized Nim code shine.

drujensen commented 4 years ago

I think the best solution is to implement #85 and change the exmamples to take a parameter . This should prevent some of the optimizations made here but still allow for languages to shine that have better inlining. For now, I have added the flag no-inline-small-functions.

francogrex commented 3 years ago

In a way I both agree and disagree with the comment: compiler optimization should not be restricted, if a language allows it, so much the better. I think however the optimized folder contains code that is "human" optimized (see my code with the lisp memoized function I opened in the issues); and that may be cheating. Maybe fibonacci func is not the best to benchmark for that specific reason?

drujensen commented 2 years ago

I have moved back to allowing compiler optimizations. However, I will only allow compiler flags that are deemed safe and "release" quality i.e. -O3.

pcordes commented 1 year ago

Maybe fibonacci func is not the best to benchmark for that specific reason?

Naive recursive Fibonacci, O(Fib(N)) is definitely not representative of most code. The fact that there's an O(N) algorithm (iterative x+=y; y+=x; or x+=y; swap(x,y)) means that there's room for sufficiently clever optimizers to find big improvements.

If you choose naive recursive Fibonacci as your test function in the first place, surely the point is to test how well compilers convert inefficient double-recursion to something like a loop containing single recursion, or even something better. And function-call overhead since a single addition is a tiny amount of work for a single function call.

Doubly-recursive functions for traversing binary trees are common in real programs, but they differ in not having redundancy / common-subexpressions with each other. So that recursion pattern is one that's important for compilers to know how to optimize some.

If you want to know how a language performs in general for a variety of computational problems, naive Fibonacci isn't a good choice, but it is an interesting and well-known problem to benchmark. (https://stackoverflow.com/questions/76611371/execution-time-of-recursive-fibonacci-function-is-slower-in-c-than-equivalent-ni is an example of someone being puzzled by Nim being faster than C on this benchmark.)

Perhaps some text somewhere pointing out that apparently trivial differences can lead to major compiler optimizations could avoid confusing people?

drujensen / fib

The fastest languages are cheating #119