Closed GunnarVB closed 3 years ago
The LOOP should us a construct which counts down like
SUBQ.L #1,D1 ; BNE
Using a Loopcounter which counts down will remove the dependancy issue with the work pointer and will allow parallel execution.
Changing the count direction would change the order how the memory is set. This is not a valid approach for a C-compiler.
I think you misunderstood what I said:
GOOD CODE will look like this
move.l #length,D0
.L3:
move.l #50,(a0)+
subq.l #1,d0
jne .L3
** Also GCC 2.9 will create code like this for 68k. And GCC 6.5 will create code like this also for other targets like PPC. It will for GCC correctly avoid thew dependency of Pointer and counter!
The best code is when you Make the PTR access ++ in the PTR register and independant from this and in parallel count the workloop down in another Data register
guess you have to rewrite your code:
int ArrayFill(int *array, int len) {
int i = 0, j = len - 1;
for(; j >= 0; ++i, --j) {
array[i] = 50;
}
return 1;
}
this yields:
_ArrayFill:
moveq #-1,d0
add.l (8,sp),d0
jmi .L6
move.l (4,sp),a0
.L3:
moveq bebbo/amiga-gcc#50,d1
move.l d1,(a0)+
dbra d0,.L3
clr.w d0
subq.l bebbo/amiga-gcc#1,d0
jcc .L3
.L6:
moveq bebbo/amiga-gcc#1,d0
rts
.L3:
moveq bebbo/amiga-gcc#50,d1
move.l d1,(a0)+
dbra d0,.L3
Hallo Stefan,
Der MOVEQ kosten jeden Loopdurhcgang einen unnötigen Cycle Der Befehl sollte aus dem LOOP raus. Also gut wäre den MOVE vor den LOOP, oder gar keinen MOVEQ und stattdessen MOVE.L #im,(A0)+ OK?
Wegen den Befehlen zum Loop machen. GCC 2.9 hat den LOOP automatisch sehr gut umgesetzt:
MOVE.L #Länge,D0
LOOP:
SUBQ.L bebbo/amiga-gcc#1,D0
bne LOOP
Der code ist sehr gut. Das ist schnell und auch ohne Abhängigkeit zum Adr-Ptr. Es wäre gut wenn GCC 6.5 das genauso machen würde.
SUBQ.L #1,D0
BNE.s LOOP
the loop code of 2.95 is
L12:
moveq bebbo/amiga-gcc#50,d1
movel d1,a0@+
subql bebbo/amiga-gcc#1,d0
jne L12
pulling the constant out is not that easy, since there are passes which join the insn - and on the Amiga it gets split after all loop stuff is done...
... it's possible to use the constant directly:
.L11:
move.l bebbo/amiga-gcc#50,(a0)+
cmp.l a0,a1
jne .L11
or if you change the loop to count down:
.L3:
move.l bebbo/amiga-gcc#50,(a0)+
dbra d0,.L3
clr.w d0
subq.l bebbo/amiga-gcc#1,d0
jcc .L3
... it's possible to use the constant directly:
.L11: move.l #50,(a0)+
Yes this will be much better.
or if you change the loop to count down:
You can call my crazy but I think it would be nice if GCC improves all programs. And if people not need to change each of their loop per hand in all programs and all operating systems they might want to compile. ;-)
What do you think can we fix this? GCC 2.9 did do this code for the Loop
SUBQ.L #1,D0
BNE
Can we get the same back? This code is better as its not depending on the PTR and can be executed without dependancy
SUBQ.L #1,D0 BNE
Can we get the same back? This code is better as its not depending on the PTR and can be executed without dependancy
negative. Guess you have to open a real gcc issue for that - and no other platform seems to need that...
=> rewrite the loop
Guess you have to open a real gcc issue for that - and no other platform seems to need that... What makes you think this?
Everybody will agree that creating an unneeded register dependency is hurting every platform. This is just logical.
If you look at the code generated by GCC for PowerPC then you see that is will not use a CMP on the PTR but will use PTR and Countdown register fully independent.
By using the "(A0)++" operation and the "CMP A0", the compiler creates a register dependency which the C high level code did not have! Or to say it clearly: The created code got much worse to the C original.
Would it be possible for GCC to model the 68K pipeline and for GCC to just "see" the dependency? Then GCC would see this is bad code.
Or would it be simpler for you to cheat and make the combo of "SUBq.l #0,D0; BNE" a little cheaper than the "CMPA BNE "
this is the loop body for x86:
.L3:
movl $50, (%rdi)
addq $4, %rdi
cmpq %rax, %rdi
jne .L3
.L2:
And the reason is, that gcc only introduces a count down tmp variable for const int limits.
the code for arm:
.L3:
str w3, [x0, x2, lsl 2]
add x2, x2, 1
cmp w1, w2
bgt .L3
Or would it be simpler for you to cheat and make the combo of "SUBq.l #0,D0; BNE" a little cheaper than the "CMPA BNE "
Or would it be simpler for you to cheat and make the combo of "SUBq.l #0,D0; BNE" a little cheaper than the "CMPA BNE "
won't work... sorry
I found an old gcc version (4.8.5) for powerpc which did it: https://godbolt.org/z/o9Knav
=> https://franke.ms/cex/z/93a7cb
the loop header can be improved... but the loop body?
well done!
This is a big improvement!
fixed
Hello,
Please allow me to report a performance slow down by several times for some code, when comparing to GCC 2.9
Example Code:
** ASM created by GCC 6.5 -mregparm -m68080 -O3 -fomit-frame-pointer
** The performance problem is clearly visible. We have 4 instruction in the workloop.
LOOP BODY
**
Proposed Solution: 1) The MOVEQ needs be moved outside the Loop. 2) The LOOP should us a construct which counts down like
SUBQ.L #1,D1 ; BNE
Using a Loopcounter which counts down will remove the dependancy issue with the work pointer and will allow parallel execution.