for() loop setup generates a call to ___mulsi3 when a mulu.w would work.

darrenfreeman commented 3 years ago

Using commit 03d95f15d2d7c92ed253152c0e50bb080ee323d4 .

The for() loop has been refactored into a loop using pointer increments, and comparison against a pointer to the end of the array. This is great to see! However, calculation of the end address is using the dreaded ___mulsi3 when the loop count is a byte. See also issue #165 for a discussion on why even mulu.w #6,d0 is far from optimal.

Note that this seems to be a wider problem, I see unwanted calls to ___mulsi3 all the time. It's really expensive!

I would like to firstly see the compiler using a multiply instruction whenever the operands are 16-bits or less.

But what I had in mind, for this loop, is the compiler using dbra rather than comparing the pointer with the end of the array. That sidesteps the need for a multiply. I have made my code horribly unreadable by explicitly writing loops to target dbra, but in the simple cases where the loop is already being refactored, I was hoping it would tend to pick dbra.

bebbo commented 3 years ago

-mcpu=cpu

    Generate code for a specific M680x0 or ColdFire processor.  ...

-mtune=tune

    Tune the code for a particular microarchitecture within the constraints set by -march and -mcpu. ...

have a look here: https://gcc.gnu.org/onlinedocs/gcc/M680x0-Options.html#M680x0-Options

Your options do not specify -mcpu which selects the default on gcc6.5.0b which is -mcpu=68000.

Since pointer arithmetic is involved the math is 32bit.
optimzing for 68060 means: don't use muls.l dx,dy,dz or similar 32x32->64
since no mul is used, there is no mul to optimize via add/lsl/sub/whatever
68000 must use ___mulsi3 to multiply

=> muls.l dx,dy should be allowed for 68060

bebbo commented 3 years ago

it should look like: https://franke.ms/cex/z/667z8q

btw: why a 32 bit counter? The counter is also used in pointer arithmetics. Thus before optimization it has to be extended from byte to int. The optimization reduces this to use a 32 counter instead. Later it's use in pointer arithmetic is replaced with auto increment. There is no way to know that now a 8 bit counter would be sufficient.

darrenfreeman commented 3 years ago

You have convinced me not to tune for the higher processors when compiling for 68000, and to be especially careful with the 060. But some of what I'm seeing doesn't appear to be sensible behaviour.

I note that simply dropping the -60 from the end of the compiler options, in the toy code that I gave, swaps the function-call-multiply with a few additions instead. So tuning for an 060 actually causes it to use a function call, when tuning for an 020 uses a bunch of adds. This can't be correct behaviour.

As a second issue, indeed the compiler is promoting pointer arithmetic multiplies to 32-bits, and to support the 68000, every 32-bit multiply becomes a function call. If you follow the generated assembly, the multiply is being done between types that fit in <= 16 bits. I understand there is a reason for the compiler forgetting the size of the type while generating the assembly, but the output is still some of the worst code ever seen.

I have been really mangling my source to try to avoid this sort of thing in the output, array indexing really shouldn't be so expensive as to involve a function call every time. There appears to be no benefit to using an index, or array size, smaller than 32-bits.

bebbo commented 3 years ago

some hints:

do not use byte as counters, use short or int

try simple code, e.g this one yields a better result

void foo2(short num) {
struct Thing *thing = things;
while (num-- > 0) {      
thing->a = 0;
thing->b = 0;
thing->c = 0;
++thing;
}
}

bebbo commented 2 years ago

can't do more - code seems ok now

bebbo / gcc

for() loop setup generates a call to ___mulsi3 when a mulu.w would work. #166