Closed nonarkitten closed 5 years ago
Also this (lines 16-18):
movel d1,d0
moveq #28,d6
lsrl d6,d0
Is probably a tad slower than this:
bfextu d1{#0:#4},d0
It's also clearly breaking order, since the code was p0, p1, p2 and p3. The compiler ERRONEOUSLY decided to do p1, p2, p3 and then p0 immediately before the EA calculation which will cause a stall. The bfextu SHOULD go before line 13.
Hmmm, same with this:
L10:
movew d1,d4
andw #15,d4
Versus
bfextu d1{#28,#4},d4
move.w
as andi
taking 1 cycle on 68060
bfextu
takes 6 cycles
moveq
takes 1 cycle
lsr
takes 1 cycle
=> all varianst with 2 or 3 insns are faster than bfextu
but with -Os the shorter one should be preferred...
plus you should not use short ints.
try
uint32_t pixel1,pixel2,pixel3,pixel4;
and be surprised.
lol, can't eliminate the condition.
compiling for -Os
And sorry, used to 68080 timings. Bit-fields on the Apollo are single cycle.
So changing the -m to 68080 (consider my mind blown), it switches back to bfextu BUT still has the and...?
bfextu d0{#0:#4},d4
moveq #15,d6
and.l d6,d4
move.l (a0,d4.l*4),d4
I think the code should look more like this -- unless I'm missing something... https://paste.ofcode.org/fGyJGmyNSEEVHfr3B2DuRL
Okay, realized I was making assumptions about the palette structure the compiler couldn't know. This is a lot better but there's still a lot of extraneous bit-masking that's pointless after the bfextu command (all 28 other bits will always be zero).
I assume -m68080 is mostly a stub, perhaps the Apollo team can work together with Bebbo with proper optimization for 68080? Assuming the compiler to make the best possible code for 68080 when it tries to make the best possible code for the older CPU's doesn't end well. Might as well do inline ASM functions then?
Maybe in the future a -m68020-080 option could be available for backwards compatibility.
On Wed, Feb 13, 2019 at 1:09 AM nonarkitten notifications@github.com wrote:
Okay, realized I was making assumptions about the palette structure the compiler couldn't know. This is a lot better but there's still a lot of extraneous bit-masking that's pointless after the bfextu command (all 28 other bits will always be zero).
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/bebbo/gcc/issues/67#issuecomment-462997162, or mute the thread https://github.com/notifications/unsubscribe-auth/AIXahRjG0_lX1WgBn-qPIxnyslhV0BJMks5vM1fKgaJpZM4a4EvT .
EDIT: some progress:
_draw:
movem.l a2/d5/d4/d3/d2,-(sp)
move.l 24(sp),a1
move.l 28(sp),a2
move.l 32(sp),d0
lsl.l #6,d0
move.l d0,a0
addq.l #2,a0
add.l __palette,a0
clr.l d1
.L12:
move.l (a2)+,d0
jeq .L3
bfextu d0{#4:#4},d3
bfextu d0{#8:#4},d2
bfextu d0{#12:#4},d5
bfextu d0{#0:#4},d4
move.w (a0,d4.l*4),d4
jlt .L4
move.w d4,(a1,d1.l)
.L4:
bfextu d0{#16:#4},d4
move.w (a0,d3.l*4),d3
jlt .L5
move.w d3,2(a1,d1.l)
.L5:
bfextu d0{#20:#4},d3
move.w (a0,d2.l*4),d2
jlt .L6
move.w d2,4(a1,d1.l)
.L6:
bfextu d0{#24:#4},d2
move.w (a0,d5.l*4),d5
jlt .L7
move.w d5,6(a1,d1.l)
.L7:
bfextu d0{#28:#4},d0
move.w (a0,d4.l*4),d4
jlt .L8
move.w d4,8(a1,d1.l)
.L8:
move.w (a0,d3.l*4),d3
jlt .L9
move.w d3,10(a1,d1.l)
.L9:
move.w (a0,d2.l*4),d2
jlt .L10
move.w d2,12(a1,d1.l)
.L10:
move.w (a0,d0.l*4),d0
jlt .L3
move.w d0,14(a1,d1.l)
.L3:
add.l #768,d1
cmp.l #6144,d1
jne .L12
movem.l (sp)+,d2/d3/d4/d5/a2
rts
I think the code should look more like this -- unless I'm missing something... https://paste.ofcode.org/fGyJGmyNSEEVHfr3B2DuRL
that code is wrong, since it uses move.l for 16 bit values ...
That looks 90% better -- the only issue would be the out of order first four bfextu instructions.
The code I output was from the Compiler Explorer simply devoid of the the extraneous operations. I wasn't actually checking for accuracies in the moves.
I assume -m68080 is mostly a stub, perhaps the Apollo team can work together with Bebbo with proper optimization for 68080? Assuming the compiler to make the best possible code for 68080 when it tries to make the best possible code for the older CPU's doesn't end well. Might as well do inline ASM functions then? Maybe in the future a -m68020-080 option could be available for backwards compatibility. …
I'm sure the team would love this. But a lot of the fixes that apply to 68080 apply to the 68060 as well. The pipelining requires that results for EA calculation be ready earlier or it can cause a bubble. So GCC's tendency to nest things together is actually detrimental for performance.
That looks 90% better -- the only issue would be the out of order first four bfextu instructions.
why? since d0 is read only they should not block
In the case where we move, then yes, the pipeline can forward the results from the first to second instruction and not cause a stall. This is not the case with EA computation.
bfextu d0{#0:#4},d4
move.w (a0,d4.l*4),d4
The problem is d4 and when EA gets computed in the pipeline, the above sequence will cause a two cycle stall while the move waits for d4. This is pointed out in 10.2 (Point 3) in the 68060 manual.
Okay, but that would be a new ticket.
it's not live yet
please test
Pointless test came back. This may be a new issue.
move.l (a0,d2.l*4),d2
tst.w d2
jlt .L10
that test is not pointless, since the move might set the upper 2 bytes to nonzero.
Good point. I don't believe the C standard defines how to handle a cast to smaller type, so here the move.l could be changed to a move.w. I missed that.
And this can be fixed in C code by setting the pixel array all to int16_t, which is probably the more correct way to do this anyway. So I would say this is fixed.
https://tinyurl.com/y4b9dk8x
There are TWO issues here.
The first which 2.95 get's wrong and 6.5 get's correct is the weird, completely unnecessary move to aX before the second move since the bfextu guarantees that the high-word is all cleared. Such as this:
However, in all OTHER regards, 2.95 is still making substantially better code here.