Sub optimal code (more bitfield fun)

nonarkitten commented 5 years ago

There are TWO issues here.

The first which 2.95 get's wrong and 6.5 get's correct is the weird, completely unnecessary move to aX before the second move since the bfextu guarantees that the high-word is all cleared. Such as this:

L12:
        movew d3,a0
        movew a2@(2,a0:l:4),d3

However, in all OTHER regards, 2.95 is still making substantially better code here.

nonarkitten commented 5 years ago

Also this (lines 16-18):

        movel d1,d0
        moveq #28,d6
        lsrl d6,d0

Is probably a tad slower than this:

        bfextu d1{#0:#4},d0

It's also clearly breaking order, since the code was p0, p1, p2 and p3. The compiler ERRONEOUSLY decided to do p1, p2, p3 and then p0 immediately before the EA calculation which will cause a stall. The bfextu SHOULD go before line 13.

nonarkitten commented 5 years ago

Hmmm, same with this:

L10:
        movew d1,d4
        andw #15,d4

Versus

        bfextu d1{#28,#4},d4

bebbo commented 5 years ago

move.w as andi taking 1 cycle on 68060 bfextu takes 6 cycles

bebbo commented 5 years ago

moveq takes 1 cycle lsr takes 1 cycle

=> all varianst with 2 or 3 insns are faster than bfextu

but with -Os the shorter one should be preferred...

bebbo commented 5 years ago

plus you should not use short ints.

try

    uint32_t pixel1,pixel2,pixel3,pixel4;

and be surprised.

nonarkitten commented 5 years ago

lol, can't eliminate the condition.

nonarkitten commented 5 years ago

compiling for -Os

nonarkitten commented 5 years ago

And sorry, used to 68080 timings. Bit-fields on the Apollo are single cycle.

nonarkitten commented 5 years ago

So changing the -m to 68080 (consider my mind blown), it switches back to bfextu BUT still has the and...?

        bfextu d0{#0:#4},d4
        moveq #15,d6
        and.l d6,d4
        move.l (a0,d4.l*4),d4

nonarkitten commented 5 years ago

I think the code should look more like this -- unless I'm missing something... https://paste.ofcode.org/fGyJGmyNSEEVHfr3B2DuRL

nonarkitten commented 5 years ago

Okay, realized I was making assumptions about the palette structure the compiler couldn't know. This is a lot better but there's still a lot of extraneous bit-masking that's pointless after the bfextu command (all 28 other bits will always be zero).

https://tinyurl.com/y2hur2qp

MBeijer commented 5 years ago

I assume -m68080 is mostly a stub, perhaps the Apollo team can work together with Bebbo with proper optimization for 68080? Assuming the compiler to make the best possible code for 68080 when it tries to make the best possible code for the older CPU's doesn't end well. Might as well do inline ASM functions then?

Maybe in the future a -m68020-080 option could be available for backwards compatibility.

On Wed, Feb 13, 2019 at 1:09 AM nonarkitten notifications@github.com wrote:

Okay, realized I was making assumptions about the palette structure the compiler couldn't know. This is a lot better but there's still a lot of extraneous bit-masking that's pointless after the bfextu command (all 28 other bits will always be zero).

https://tinyurl.com/y2hur2qp

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/bebbo/gcc/issues/67#issuecomment-462997162, or mute the thread https://github.com/notifications/unsubscribe-auth/AIXahRjG0_lX1WgBn-qPIxnyslhV0BJMks5vM1fKgaJpZM4a4EvT .

bebbo commented 5 years ago

EDIT: some progress:

_draw:
        movem.l a2/d5/d4/d3/d2,-(sp)
        move.l 24(sp),a1
        move.l 28(sp),a2
        move.l 32(sp),d0
        lsl.l #6,d0
        move.l d0,a0
        addq.l #2,a0
        add.l __palette,a0
        clr.l d1
.L12:
        move.l (a2)+,d0
        jeq .L3
        bfextu d0{#4:#4},d3
        bfextu d0{#8:#4},d2
        bfextu d0{#12:#4},d5
        bfextu d0{#0:#4},d4
        move.w (a0,d4.l*4),d4
        jlt .L4
        move.w d4,(a1,d1.l)
.L4:
        bfextu d0{#16:#4},d4
        move.w (a0,d3.l*4),d3
        jlt .L5
        move.w d3,2(a1,d1.l)
.L5:
        bfextu d0{#20:#4},d3
        move.w (a0,d2.l*4),d2
        jlt .L6
        move.w d2,4(a1,d1.l)
.L6:
        bfextu d0{#24:#4},d2
        move.w (a0,d5.l*4),d5
        jlt .L7
        move.w d5,6(a1,d1.l)
.L7:
        bfextu d0{#28:#4},d0
        move.w (a0,d4.l*4),d4
        jlt .L8
        move.w d4,8(a1,d1.l)
.L8:
        move.w (a0,d3.l*4),d3
        jlt .L9
        move.w d3,10(a1,d1.l)
.L9:
        move.w (a0,d2.l*4),d2
        jlt .L10
        move.w d2,12(a1,d1.l)
.L10:
        move.w (a0,d0.l*4),d0
        jlt .L3
        move.w d0,14(a1,d1.l)
.L3:
        add.l #768,d1
        cmp.l #6144,d1
        jne .L12
        movem.l (sp)+,d2/d3/d4/d5/a2
        rts

bebbo commented 5 years ago

I think the code should look more like this -- unless I'm missing something... https://paste.ofcode.org/fGyJGmyNSEEVHfr3B2DuRL

that code is wrong, since it uses move.l for 16 bit values ...

nonarkitten commented 5 years ago

That looks 90% better -- the only issue would be the out of order first four bfextu instructions.

The code I output was from the Compiler Explorer simply devoid of the the extraneous operations. I wasn't actually checking for accuracies in the moves.

nonarkitten commented 5 years ago

I assume -m68080 is mostly a stub, perhaps the Apollo team can work together with Bebbo with proper optimization for 68080? Assuming the compiler to make the best possible code for 68080 when it tries to make the best possible code for the older CPU's doesn't end well. Might as well do inline ASM functions then? Maybe in the future a -m68020-080 option could be available for backwards compatibility. …

I'm sure the team would love this. But a lot of the fixes that apply to 68080 apply to the 68060 as well. The pipelining requires that results for EA calculation be ready earlier or it can cause a bubble. So GCC's tendency to nest things together is actually detrimental for performance.

bebbo commented 5 years ago

That looks 90% better -- the only issue would be the out of order first four bfextu instructions.

why? since d0 is read only they should not block

nonarkitten commented 5 years ago

In the case where we move, then yes, the pipeline can forward the results from the first to second instruction and not cause a stall. This is not the case with EA computation.

        bfextu d0{#0:#4},d4
        move.w (a0,d4.l*4),d4

The problem is d4 and when EA gets computed in the pipeline, the above sequence will cause a two cycle stall while the move waits for d4. This is pointed out in 10.2 (Point 3) in the 68060 manual.

bebbo commented 5 years ago

Okay, but that would be a new ticket.

bebbo commented 5 years ago

it's not live yet

bebbo commented 5 years ago

please test

nonarkitten commented 5 years ago

Pointless test came back. This may be a new issue.

        move.l (a0,d2.l*4),d2
        tst.w d2
        jlt .L10

bebbo commented 5 years ago

that test is not pointless, since the move might set the upper 2 bytes to nonzero.

nonarkitten commented 5 years ago

Good point. I don't believe the C standard defines how to handle a cast to smaller type, so here the move.l could be changed to a move.w. I missed that.

And this can be fixed in C code by setting the pixel array all to int16_t, which is probably the more correct way to do this anyway. So I would say this is fixed.

bebbo / gcc

Sub optimal code (more bitfield fun) #67