devkitPro / nds-examples

Examples for Nintendo DS using devkitARM, calico, libnds, libdvm, maxmod, dswifi
108 stars 18 forks source link

Wrong alignment directive for arm functions in mixed thumb/arm code #15

Closed Kuratius closed 8 months ago

Kuratius commented 8 months ago

Bug Report

What's the issue you encountered?

DevKitPro's gcc may generate wrong alignment directives. This may be a bug in gcc rather than devkitpro.

How can the issue be reproduced?

Environment?

ubuntu

Additional context?

https://discourse.llvm.org/t/arm-doesnt-align-code-sections-by-4-bytes/5830#!

It seems that gcc ignores alignment directives.


ARM_CODE __attribute__((noinline)) float convert_fixed_to_float(s32 x) {
  // clzl is undefined if input is 0
  if (x == 0) {
    return 0;
  }

  u32 sign_bit = x & 0x80000000;
  // Absolute value of x
  u32 input = (x > 0) * x + (x < 0) * -x;

  u32 leading_zeros = __builtin_clzl(input);
  u32 exponent = (127 - (leading_zeros - 19)) << 23;

  u32 mantissa; 
  s32 shift = leading_zeros - 8;
  if (shift >= 0) {
    mantissa = (input << shift);
  } else {
    mantissa = (input >> (-shift));
  }
  mantissa &= 0x7FFFFF;

  u32 value = sign_bit | exponent | mantissa;

  return *(float*)&value;
}

this, together with the makefile from the helloworld example, modified to include -march=armv5te in the arch section, and -S in the CFlags section, generates the following assembly:

.LFE203:
    .size   _Z6Vblankv, .-_Z6Vblankv
    .align  2
    .global _Z22convert_fixed_to_floati
    .arch armv5te
    .fpu softvfp
    .syntax unified
    .arm
    .type   _Z22convert_fixed_to_floati, %function
_Z22convert_fixed_to_floati:
.LVL0:
.LFB204:
    .loc 1 44 72 is_stmt 1 view -0
    .cfi_startproc
    @ args = 0, pretend = 0, frame = 0
    @ frame_needed = 0, uses_anonymous_args = 0
    @ link register save eliminated.
    .loc 1 46 3 view .LVU7
    cmp r0, #0
    .loc 1 46 3 is_stmt 0 view .LVU8
    beq .L9
    .loc 1 50 3 is_stmt 1 view .LVU9
    .loc 1 52 39 is_stmt 0 view .LVU10
    rsb r3, r0, #0
    .loc 1 52 37 view .LVU11
    movge   r3, #0
    .loc 1 52 27 view .LVU12
    cmp r0, #0
    addge   r3, r3, r0
    .loc 1 54 37 view .LVU13
    clz r1, r3
    .loc 1 50 7 view .LVU14
    and r2, r0, #-2147483648
.LVL1:
    .loc 1 52 3 is_stmt 1 view .LVU15
    .loc 1 54 3 view .LVU16
    .loc 1 55 3 view .LVU17
    .loc 1 55 23 is_stmt 0 view .LVU18
    rsb r0, r1, #146
.LVL2:
    .loc 1 59 3 view .LVU19
    subs    r1, r1, #8
.LVL3:
    .loc 1 62 27 view .LVU20
    rsbmi   r1, r1, #0
.LVL4:
    .loc 1 60 14 view .LVU21
    lslpl   r3, r3, r1
.LVL5:
    .loc 1 62 14 view .LVU22
    lsrmi   r3, r3, r1
    .loc 1 55 7 view .LVU23
    lsl r0, r0, #23
.LVL6:
    .loc 1 57 3 is_stmt 1 view .LVU24
    .loc 1 58 3 view .LVU25
    .loc 1 59 3 view .LVU26
    .loc 1 60 5 view .LVU27
    .loc 1 62 5 view .LVU28
    .loc 1 64 3 view .LVU29
    .loc 1 66 3 view .LVU30
    .loc 1 68 3 view .LVU31
    .loc 1 64 12 is_stmt 0 view .LVU32
    bic r3, r3, #-16777216
.LVL7:
    .loc 1 66 24 view .LVU33
    orr r0, r2, r0
.LVL8:
    .loc 1 64 12 view .LVU34
    bic r3, r3, #8388608
    .loc 1 68 20 view .LVU35
    orr r0, r0, r3
    bx  lr
.LVL9:
.L9:
    .loc 1 47 12 view .LVU36
    mov r0, #0
.LVL10:
    .loc 1 69 1 view .LVU37
    bx  lr
    .cfi_endproc

Arm code must have .align 4, not 2.

Imo this is probably still a bug, but probably a bug in gcc. The correct thing to do would be to generate .align 4.

Kuratius commented 8 months ago

https://stackoverflow.com/questions/12098044/why-is-gcc-emmiting-code-aligned-to-a-2-byte-boundary-for-the-arm-instruction-se

GCC apparently doesnt use .align the way the arm documentation does.

WinterMute commented 8 months ago

It's mentioned in the docs at https://sourceware.org/binutils/docs/as/Align.html but easy to miss if you don't know it's there.

For other systems, including ppc, i386 using a.out format, arm and strongarm, it is the number of low-order zero bits the location counter must have after advancement. For example ‘.align 3’ advances the location counter until it is a multiple of 8. If the location counter is already a multiple of 8, no change is needed.

WinterMute commented 8 months ago

fwiw. The nds has no FPU. We strongly recommend not using floating point math on this system.

Kuratius commented 8 months ago

It's in the context of the sm64 DSi port, which would require a lot of rewriting to remove literally every floating point use.

The example I posted is from AngelTomkins (they also wrote a version for converting floats back to fixed point) and was written when we were checking to see if it was possible to replace some of the sine/cosine functions that rely on LUTs and float polynomials (sm64 has two different sine implementations for some reason) with fixed point versions based on some extremely fast assembly and convert that fixed point number to a float faster than a LUT table lookup from main ram. The function is apparently 2x as fast as libnds' default fixed to float conversion, but only works with fixed point numbers that use 2^-12 as the quantization units and probably only works in a somewhat narrow range, but for sine you only need -1 to 1 .

In general the equation that you need to solve to convert a fixed point number into a float is

fixedpointNumberquantizationunit (e.g. 2^-12) = (1+ mantissa/2^23)2^(exponent-127)

with the restriction that the exponent needs to be chosen in a way such that the mantissa is a positive number. This is mostly doable with shifts, adds, logical operations and a custom arm instruction that counts leading zeroes in a number, instead of a floating point multiply.