bebbo / gcc

Bebbo's gcc-6-branch for m68k-amigaos
GNU General Public License v2.0
33 stars 11 forks source link

LOOP construct good with Os but not optimal with -O2 -O3 -OFast #217

Closed GunnarVB closed 8 months ago

GunnarVB commented 8 months ago

Hallo Bebbo, I hope you are OK.

Maybe this was reported before.

I found that the basic loop construct loop very good when compiled with -Os but looks not optimal when compiled with any other O mode

C Example:

void memclr (short length, char * ptr)
{
 for(;length--;){
   *ptr++= 0;
 }
}

compile with -mregparm=2 -Os

_memclr:
        jra .L2
.L3:
        clr.b (a0)+
.L2:
        dbra d0,.L3
        rts

Good result! 4 instructions total. Bra to DBRA - this is both short and fast.

not optimal result when compile with -O2 or -O3 -OFast

_memclr:
        move.w d0,d1
        subq.w bebbo/amiga-gcc#1,d1
        tst.w d0
        jeq .L1
.L5:
        clr.b (a0)+
        dbra d1,.L5
.L1:
        rts

8 instructions total 4 instruction header instead 1 BR This result is not good. the BRA to the DBRA was much better

Hello Bebbo, do you know a way to enable the BRA to the DBRA in all -O options? This would be very good for code size and for performance on all 68K members!

Many thanks in advance

regards Gunnar

bebbo commented 8 months ago

you might consider using -fno-tree-ch to avoid the duplication of loop conditions.

see http://franke.ms/cex/z/W9s3rb

GunnarVB commented 8 months ago

-OS creates:

        jra .L2
.L3:
        clr.b (a0)+
.L2:
        dbra d0,.L3

This is good. The LOOP is only 2 instructions.

The BRA has nothing to predict. Branch prediction is not needed here and this is optimal fast and small.

The O2 version

_memclr:
        move.w d0,d1
        subq.w #1,d1
        tst.w d0
        jeq .L1
.L5:
        clr.b (a0)+
        dbra d1,.L5

This needs 5 instructions instead 2! for the loop The beq is "unsure" and needs be predicted This can cause misprediction. This code is really not optimal.

using your proposed extra "flag"

_memclr:
        dbra d0,.L3
.L6:
        rts
.L3:
        clr.b (a0)+
        dbra d0,.L3
        jra .L6

we have 3 instructions instead 2. This makes 10 byte instead 6 byte code size. The DBRA at first would be predicted per default on 68k to be backward taken. But the default run is forward ... this is not optimal. Yes this option is less bad than the current O2. But really not as good as the Os.

Could the way the Os goes be done always?

GunnarVB commented 8 months ago

move.w d0,d1 subq.w #1,d1 tst.w d0 jeq .L1


next question : Why is there a TST instruction in this code? The tst is not needed, it could be done like this?

move.w d0,d1 jeq .L1 subq.w #1,d1