Remove ALIGN_LONG and simplify code

davidgfnet commented 3 years ago

So turns out that ARM aligned loads are quite fun. Running Genesis (built for armv7) on an armv8 kernel, results in crashes due to stm/ldm instructions. The reason is that armv7 kernels "fixup" the unaligned instructions but armv8 kernels don't do that anymore.

This patch falls back to code that always work in all platforms. In fact modern gcc/clang produces the same optimized code as before (a single load/store) in platforms that support it like x86 or powerpc.

davidgfnet commented 3 years ago

Let me know if you wanna discuss this PR or some clarification/further fixes. Thanks a lot!

ekeeke commented 3 years ago

The problem I see with this PR is that it fallbacks to always use 4 byte accesses instead of a direct long word access, even when address is aligned to 4-byte boundary or when platform does not need -DALIGN_LONG, which surely makes the code less efficient on some platforms with compilers that don't know how to optimize that, so I can certainly not accept it in that state, sorry.

I am not sure to understand your issue, shouldn't it be fixed by only definining -DALIGN_LONG for ARMV8 platforms and not for ARMV7 platform and generate different executables depending on the platform ? What 'unaligned instruction' are you talking about ? There shouldn't be any if you define ALIGN_LONG since the macros fall back to byte access if address is not aligned to long word boundary. Or do you mean that some arm compiler optimize these 4 byte accesses into one single long word access if they feel the processor can deal with misaligned access ? If that's really the case, how would this fix your usecase since it means optimized code would still be produced when compiled for armv7 and would still crash on armv8, right ? Seems to me it works because the produced code is always using unoptimized access now.

Anyway, the fact that compilers are able to optimize consecutive byte access would need to be confirmed with every compilers and every platform, so I would prefer to keep the ALIGN_LONG define to only use READ_LONG and WRITE_LONG macros when needed.

EDIT: changed a few wordings as I initially missed the end of your post about compiler optimization

ekeeke commented 3 years ago

Looking a bit more into it, it looks like your crash occur because of STM/LDM instructions being used by armv7 compiler to deal with 32-bit accesses when it knows they are aligned (which is the case in the 'else' part of the existing macros that you removed) and the fact that these two instructions are not supported on ARMv8.

And your fix basically forces the compiler to always use normal byte load/store instructions compatible to both ARMv7 and ARMv8 platforms.

Rather than hacking out the code to force compiler to generate most compatible (but less optimized) code, a cleaner fix in my opinion would be for libretro Makefiles to produce different executables for ARMv7 and ARMv8 and do NOT use ARMv7 executable on ARMv8 (or the opposite) since they do not have 100% compatible instruction sets.

davidgfnet commented 3 years ago

Correct me if I'm wrong, but all builds should be using ALIGN_LONG except for x86/64 and perhaps powerpc right? Both x86 and powerpc gcc/clang already (I tested new and old gcc versions) emit single mov/load to replace the inline function. Or am I missing something? BTW I have a freamework to run the emu with hundreds of ROMs and I could measure the performance impact of this, if you are interested. Thanks for replying and looking into this PR! Appreciated!

ekeeke commented 3 years ago

This is the code (roughly corresponding to DRAW_COLUMN macro extracted from render_bg_m5 function) generated by gcc 10.x for the Gamecube/Wii standalone version (Power PC CPU) with the current version (not using READ_LONG/WRITE_LONG):

.text1:80032364                 rlwinm  %r18, %r5, 22,13,25
.text1:80032368                 clrlslwi %r4, %r5, 19,6
.text1:8003236C                 lis     %r7, ((unk_803DEA04+0x10000)@h)
.text1:80032370                 or      %r18, %r18, %r11
.text1:80032374                 subi    %r7, %r7, 0x15FC # unk_803DEA04
.text1:80032378                 or      %r4, %r4, %r11
.text1:8003237C                 lis     %r24, ((unk_802ACFA4+0x10000)@h)
.text1:80032380                 add     %r17, %r7, %r18
.text1:80032384                 rlwinm  %r16, %r5, 5,27,29
.text1:80032388                 subi    %r24, %r24, 0x305C # unk_802ACFA4
.text1:8003238C                 rlwinm  %r5, %r5, 21,27,29
.text1:80032390                 add     %r12, %r7, %r4
.text1:80032394                 lwzx    %r15, %r24, %r16
.text1:80032398                 lwzx    %r18, %r7, %r18
.text1:8003239C                 lwzx    %r16, %r24, %r5
.text1:800323A0                 lwz     %r17, 4(%r17)
.text1:800323A4                 or      %r18, %r18, %r15
.text1:800323A8                 lwz     %r5, 4(%r12)
.text1:800323AC                 lwzx    %r7, %r7, %r4
.text1:800323B0                 or      %r17, %r17, %r15
.text1:800323B4                 or      %r5, %r5, %r16
.text1:800323B8                 stw     %r17, -0xC(%r8)
.text1:800323BC                 or      %r7, %r7, %r16
.text1:800323C0                 stw     %r18, -0x10(%r8)
.text1:800323C4                 stw     %r5, -4(%r8)
.text1:800323C8                 stw     %r7, -8(%r8)

And this is the code using your changes (forcing use of READ_LONG/WRITE_LONG):

.text1:800323C4                 rlwinm  %r5, %r7, 22,13,25
.text1:800323C8                 clrlslwi %r17, %r7, 19,6
.text1:800323CC                 lis     %r23, ((unk_803DEEE4+0x10000)@h)
.text1:800323D0                 or      %r5, %r5, %r25
.text1:800323D4                 subi    %r23, %r23, 0x111C # unk_803DEEE4
.text1:800323D8                 or      %r17, %r17, %r25
.text1:800323DC                 lis     %r4, ((unk_802AD484+0x10000)@h)
.text1:800323E0                 add     %r15, %r23, %r17
.text1:800323E4                 add     %r11, %r23, %r5
.text1:800323E8                 subi    %r4, %r4, 0x2B7C # unk_802AD484
.text1:800323EC                 rlwinm  %r12, %r7, 5,27,29
.text1:800323F0                 rlwinm  %r7, %r7, 21,27,29
.text1:800323F4                 lwzx    %r16, %r23, %r5
.text1:800323F8                 lwzx    %r7, %r4, %r7
.text1:800323FC                 lwzx    %r5, %r4, %r12
.text1:80032400                 lwz     %r11, 4(%r11)
.text1:80032404                 lwz     %r4, 4(%r15)
.text1:80032408                 lwzx    %r23, %r23, %r17
.text1:8003240C                 or      %r11, %r5, %r11
.text1:80032410                 or      %r4, %r7, %r4
.text1:80032414                 or      %r5, %r5, %r16
.text1:80032418                 or      %r7, %r7, %r23
.text1:8003241C                 srwi    %r16, %r11, 8
.text1:80032420                 srwi    %r17, %r11, 16
.text1:80032424                 srwi    %r23, %r11, 24
.text1:80032428                 stb     %r11, -9(%r8)
.text1:8003242C                 srwi    %r11, %r4, 8
.text1:80032430                 srwi    %r15, %r4, 16
.text1:80032434                 stb     %r5, -0xD(%r8)
.text1:80032438                 stb     %r16, -0xA(%r8)
.text1:8003243C                 srwi    %r16, %r4, 24
.text1:80032440                 stb     %r17, -0xB(%r8)
.text1:80032444                 srwi    %r17, %r5, 8
.text1:80032448                 stb     %r23, -0xC(%r8)
.text1:8003244C                 srwi    %r23, %r5, 16
.text1:80032450                 stb     %r4, -1(%r8)
.text1:80032454                 srwi    %r5, %r5, 24
.text1:80032458                 srwi    %r4, %r7, 16
.text1:8003245C                 stb     %r7, -5(%r8)
.text1:80032460                 stb     %r11, -2(%r8)
.text1:80032464                 srwi    %r11, %r7, 8
.text1:80032468                 srwi    %r7, %r7, 24
.text1:8003246C                 stb     %r15, -3(%r8)
.text1:80032470                 stb     %r16, -4(%r8)
.text1:80032474                 stb     %r17, -0xE(%r8)
.text1:80032478                 stb     %r23, -0xF(%r8)
.text1:8003247C                 stb     %r5, -0x10(%r8)
.text1:80032480                 stb     %r11, -6(%r8)
.text1:80032484                 stb     %r4, -7(%r8)
.text1:80032488                 stb     %r7, -8(%r8)

The last one is clearly not optimized and we can see it uses (as expected) byte access and bitshifting instructions as in the originating C code.

I have hard time to believe all compilers are able to optimize READ_LONG and WRITE_LONG macro to use single long word access instructions when the C code on purpose uses byte access and bitshifting. So the ALIGN_LONG define and the C code using 32-bit pointer access should definitively stay for CPUs able to deal with unaligned long word access (which is pretty much everything except ARM I think).

I also believe that LDM/STM instructions used by ARMv7 compiler (or other long word instructions used by ARMv8 compiler) when the address is aligned are more efficient than always using alignement-safe byte access instructions so your change most likely introduce some performance penalty for ARM platforms.

Hence why I still think that the correct (and simplest) solution to your problem is to: 1) build different executable for ARMv7 and ARMv8 platforms 2) do not use executable built for ARMv7 on ARMv8 platform

ARMv7 and ARMv8 instructions sets are definitively not 100% compatible so I do not understand why you would expect executable built with ARMv7 compiler to be compatible with ARMv8 platform and why we should hack the code to try forcing compilers to generate instructions compatible with both platforms ? Surely there might be other code instances where the ARMv7 compiler could choose to use LDM/STM instructions, which would still generate an exception on ARMv8, no matter of unaligned or aligned memory access.

From libretro side, I believe it's just a matter of adding separated support for both platforms in Makefile, with different "-march" settings, right ?

davidgfnet commented 3 years ago

I mean... The C standard mandates alignment, quoting a good stack overflow post:

C11 (draft n1570) Appendix J.2:

1 The behavior is undefined in the following circumstances:

....

Conversion between two pointer types produces a result that is incorrectly aligned (6.3.2.3).

With 6.3.2.3p7 saying

[...] If the resulting pointer is not correctly aligned [68] for the referenced type, the behavior is undefined. [...]

But since it's undefined behaviour it might work... Or format your hard drive :P

ekeeke commented 3 years ago

That's a different issue: if you want the code to respect C standard and always use aligned access under every circonstancies, you need to use the ALIGN_LONG define.

So long, that C undefined behavior works fine (and gives well-defined results) on platforms that do not need ALIGN_LONG so there is no reason to remove this optional behavior, which produces more optimized code (also, if this is C standard, I have hard time believing any C compiler would turn four consecutive byte pointer access into a single long word access instruction, with the risk of forcing an unaligned access and if they do, this means they imply it does not produce undefined behavior).

Similarly, using -DALIGN_LONG produces executable that runs fine on either ARMv7 or ARMv8 platform, providing your compiler is correctly configured to emit specific architecture compatible instructions so there should not be any reason to force the code to use four consecutive byte accesses when you know the address is correctly aligned for long word access.

ekeeke / Genesis-Plus-GX

Remove ALIGN_LONG and simplify code #360