major performance degration compared to previous code

GunnarVB commented 3 years ago

Hello,

Please allow me to report a performance slow down by several times for some code, when comparing to GCC 2.9

Example Code:

int ArrayFill(int *array, int len) {
    int i = 0;
    for(; i < len; ++i) {
        array[i] = 50;
    }
    return 1;
}

** ASM created by GCC 6.5 -mregparm -m68080 -O3 -fomit-frame-pointer

_ArrayFill:
        tst.l d0
        jle .L6
        lea (a0,d0.l*4),a1
.L3:
        moveq bebbo/amiga-gcc#50,d0
        move.l d0,(a0)+
        cmp.l a0,a1
        jne .L3
.L6:
        moveq bebbo/amiga-gcc#1,d0
        rts

** The performance problem is clearly visible. We have 4 instruction in the workloop.

The 1st instruction the MOVEQ is totally wrong in the loop. This instrucion should be outside the loop!
the Loop body uses CMP An,An. This is very bad idea. The loop creates 4 instrucion of which all 4 are fully depending. the MOVEQ updates D0, on which the 2nd instruction depends. the 2nd instruction updates A0, on which the 3rd instruction depends the CMP the 3rd instruction updates the flags on which the 4th instruction depends. Because of this dependency chain this code has to be very slow.

LOOP BODY

.L3:
        moveq #50,d0
        move.l d0,(a0)+
        cmp.l a0,a1
        jne .L3

**

Proposed Solution: 1) The MOVEQ needs be moved outside the Loop. 2) The LOOP should us a construct which counts down like SUBQ.L #1,D1 ; BNE Using a Loopcounter which counts down will remove the dependancy issue with the work pointer and will allow parallel execution.

bebbo commented 3 years ago

The LOOP should us a construct which counts down like SUBQ.L #1,D1 ; BNE Using a Loopcounter which counts down will remove the dependancy issue with the work pointer and will allow parallel execution.

Changing the count direction would change the order how the memory is set. This is not a valid approach for a C-compiler.

GunnarVB commented 3 years ago

I think you misunderstood what I said:

GOOD CODE will look like this

move.l #length,D0
.L3:
move.l #50,(a0)+
subq.l   #1,d0
jne .L3

** Also GCC 2.9 will create code like this for 68k. And GCC 6.5 will create code like this also for other targets like PPC. It will for GCC correctly avoid thew dependency of Pointer and counter!

The best code is when you Make the PTR access ++ in the PTR register and independant from this and in parallel count the workloop down in another Data register

bebbo commented 3 years ago

guess you have to rewrite your code:

int ArrayFill(int *array, int len) {
    int i = 0, j = len - 1;
    for(; j >= 0; ++i, --j) {
        array[i] = 50;
    }
    return 1;
}

this yields:

_ArrayFill:
        moveq #-1,d0
        add.l (8,sp),d0
        jmi .L6
        move.l (4,sp),a0
.L3:
        moveq bebbo/amiga-gcc#50,d1
        move.l d1,(a0)+
        dbra d0,.L3
        clr.w d0
        subq.l bebbo/amiga-gcc#1,d0
        jcc .L3
.L6:
        moveq bebbo/amiga-gcc#1,d0
        rts

GunnarVB commented 3 years ago

.L3:
        moveq bebbo/amiga-gcc#50,d1
        move.l d1,(a0)+
        dbra d0,.L3

Hallo Stefan,

Der MOVEQ kosten jeden Loopdurhcgang einen unnötigen Cycle Der Befehl sollte aus dem LOOP raus. Also gut wäre den MOVE vor den LOOP, oder gar keinen MOVEQ und stattdessen MOVE.L #im,(A0)+ OK?

Wegen den Befehlen zum Loop machen. GCC 2.9 hat den LOOP automatisch sehr gut umgesetzt:

MOVE.L #Länge,D0
LOOP:
SUBQ.L bebbo/amiga-gcc#1,D0
bne LOOP

Der code ist sehr gut. Das ist schnell und auch ohne Abhängigkeit zum Adr-Ptr. Es wäre gut wenn GCC 6.5 das genauso machen würde.

SUBQ.L #1,D0
BNE.s LOOP

bebbo commented 3 years ago

the loop code of 2.95 is

L12:
        moveq bebbo/amiga-gcc#50,d1
        movel d1,a0@+
        subql bebbo/amiga-gcc#1,d0
        jne L12

pulling the constant out is not that easy, since there are passes which join the insn - and on the Amiga it gets split after all loop stuff is done...

... it's possible to use the constant directly:

.L11:
        move.l bebbo/amiga-gcc#50,(a0)+
        cmp.l a0,a1
        jne .L11

or if you change the loop to count down:

.L3:
        move.l bebbo/amiga-gcc#50,(a0)+
        dbra d0,.L3
        clr.w d0
        subq.l bebbo/amiga-gcc#1,d0
        jcc .L3

GunnarVB commented 3 years ago

... it's possible to use the constant directly:
.L11:
move.l #50,(a0)+

Yes this will be much better.

or if you change the loop to count down:

You can call my crazy but I think it would be nice if GCC improves all programs. And if people not need to change each of their loop per hand in all programs and all operating systems they might want to compile. ;-)

What do you think can we fix this? GCC 2.9 did do this code for the Loop

SUBQ.L #1,D0
BNE

Can we get the same back? This code is better as its not depending on the PTR and can be executed without dependancy

bebbo commented 3 years ago

SUBQ.L #1,D0
BNE
Can we get the same back? This code is better as its not depending on the PTR and can be executed without dependancy

negative. Guess you have to open a real gcc issue for that - and no other platform seems to need that...

=> rewrite the loop

GunnarVB commented 3 years ago

Guess you have to open a real gcc issue for that - and no other platform seems to need that... What makes you think this?

Everybody will agree that creating an unneeded register dependency is hurting every platform. This is just logical.

If you look at the code generated by GCC for PowerPC then you see that is will not use a CMP on the PTR but will use PTR and Countdown register fully independent.

By using the "(A0)++" operation and the "CMP A0", the compiler creates a register dependency which the C high level code did not have! Or to say it clearly: The created code got much worse to the C original.

Would it be possible for GCC to model the 68K pipeline and for GCC to just "see" the dependency? Then GCC would see this is bad code.

Or would it be simpler for you to cheat and make the combo of "SUBq.l #0,D0; BNE" a little cheaper than the "CMPA BNE "

bebbo commented 3 years ago

this is the loop body for x86:

.L3:
        movl    $50, (%rdi)
        addq    $4, %rdi
        cmpq    %rax, %rdi
        jne     .L3
.L2:

And the reason is, that gcc only introduces a count down tmp variable for const int limits.

bebbo commented 3 years ago

the code for arm:

.L3:
        str     w3, [x0, x2, lsl 2]
        add     x2, x2, 1
        cmp     w1, w2
        bgt     .L3

GunnarVB commented 3 years ago

Or would it be simpler for you to cheat and make the combo of "SUBq.l #0,D0; BNE" a little cheaper than the "CMPA BNE "

bebbo commented 3 years ago

Or would it be simpler for you to cheat and make the combo of "SUBq.l #0,D0; BNE" a little cheaper than the "CMPA BNE "

won't work... sorry

bebbo commented 3 years ago

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=22041

bebbo commented 3 years ago

I found an old gcc version (4.8.5) for powerpc which did it: https://godbolt.org/z/o9Knav

=> https://franke.ms/cex/z/93a7cb

the loop header can be improved... but the loop body?

GunnarVB commented 3 years ago

well done!

This is a big improvement!

bebbo commented 3 years ago

fixed

bebbo / gcc

major performance degration compared to previous code #147