bebbo / gcc

Bebbo's gcc-6-branch for m68k-amigaos
GNU General Public License v2.0
33 stars 11 forks source link

for() loop setup generates a call to ___mulsi3 when a mulu.w would work. #166

Closed darrenfreeman closed 2 years ago

darrenfreeman commented 3 years ago

Using commit 03d95f15d2d7c92ed253152c0e50bb080ee323d4 .

Toy code: http://franke.ms/cex/z/K7z9eK

The for() loop has been refactored into a loop using pointer increments, and comparison against a pointer to the end of the array. This is great to see! However, calculation of the end address is using the dreaded ___mulsi3 when the loop count is a byte. See also issue #165 for a discussion on why even mulu.w #6,d0 is far from optimal.

Note that this seems to be a wider problem, I see unwanted calls to ___mulsi3 all the time. It's really expensive!

I would like to firstly see the compiler using a multiply instruction whenever the operands are 16-bits or less.

But what I had in mind, for this loop, is the compiler using dbra rather than comparing the pointer with the end of the array. That sidesteps the need for a multiply. I have made my code horribly unreadable by explicitly writing loops to target dbra, but in the simple cases where the loop is already being refactored, I was hoping it would tend to pick dbra.

bebbo commented 3 years ago
-mcpu=cpu

    Generate code for a specific M680x0 or ColdFire processor.  ...

-mtune=tune

    Tune the code for a particular microarchitecture within the constraints set by -march and -mcpu. ...

have a look here: https://gcc.gnu.org/onlinedocs/gcc/M680x0-Options.html#M680x0-Options

Your options do not specify -mcpu which selects the default on gcc6.5.0b which is -mcpu=68000.

=> muls.l dx,dy should be allowed for 68060

bebbo commented 3 years ago

it should look like: https://franke.ms/cex/z/667z8q

btw: why a 32 bit counter? The counter is also used in pointer arithmetics. Thus before optimization it has to be extended from byte to int. The optimization reduces this to use a 32 counter instead. Later it's use in pointer arithmetic is replaced with auto increment. There is no way to know that now a 8 bit counter would be sufficient.

darrenfreeman commented 3 years ago

You have convinced me not to tune for the higher processors when compiling for 68000, and to be especially careful with the 060. But some of what I'm seeing doesn't appear to be sensible behaviour.

I note that simply dropping the -60 from the end of the compiler options, in the toy code that I gave, swaps the function-call-multiply with a few additions instead. So tuning for an 060 actually causes it to use a function call, when tuning for an 020 uses a bunch of adds. This can't be correct behaviour.

As a second issue, indeed the compiler is promoting pointer arithmetic multiplies to 32-bits, and to support the 68000, every 32-bit multiply becomes a function call. If you follow the generated assembly, the multiply is being done between types that fit in <= 16 bits. I understand there is a reason for the compiler forgetting the size of the type while generating the assembly, but the output is still some of the worst code ever seen.

I have been really mangling my source to try to avoid this sort of thing in the output, array indexing really shouldn't be so expensive as to involve a function call every time. There appears to be no benefit to using an index, or array size, smaller than 32-bits.

bebbo commented 3 years ago

some hints:

bebbo commented 2 years ago

can't do more - code seems ok now