Signed byte is incorrectly transferred over interop on iOS/Catalyst ARM64

dotMorten commented 7 months ago

Description

When sending an sbyte to a C++ library (int8_t), on iOS and MacCatalyst ARM64, the sbyte is not sent across correctly.

Reproduction Steps

Unzip the attached project: SbyteRepro.zip
Open the project MauiTestApp\MauiTestApp.csproj and run it on an iOS Device or iOS ARM64 simulator
Click the button
Notice that the button show MinusFive=251. This was expected to show -5.

That value comes from sending an sbyte=-5 to native code, let the native code convert it to an int32 and return it again. But because something goes wrong at the interop level, the value gets sent across as 251. This seems to only affect sbyte going into native code - the second part of the code tests signed bytes coming out of native code and handles negative values just fine.

A Windows version of the repro is also provided (I tested Windows ARM64 and Android ARM64 and these don't exhibit this behavior).

C++ code:

DLLEXPORT int32_t WINAPI SByteToInt(int8_t value)
{
  return static_cast<int32_t>(value);
}

C# code:

[LibraryImport("__Internal")]
private static partial Int32 SByteToInt(sbyte value);

Expected behavior

signed bytes gets sent across the interop correctly on iOS ARM64 like other platforms.

Actual behavior

-5 gets incorrectly transferred.

Regression?

No response

Known Workarounds

don't use sbyte, but transfer as int32_t and cast once in native code.

Configuration

.NET 8 ARM64 iOS and Catalyst

Other information

Interesting details for Apple ARM64 ABI: https://developer.apple.com/documentation/xcode/writing-arm64-code-for-apple-platforms#Pass-arguments-to-functions-correctly

The caller of a function is responsible for signing or zero-extending any argument with fewer than 32 bits. The standard ABI expects the callee to sign or zero-extend those arguments. The C language requires the promotion of arguments smaller than int before a call.

I wonder if .NET zero-extends our sbyte instead of sign-extending it. -5 in two's compliment int8_t would be 1111 1011 Zero-extended to int32_t would be 0000 0000 0000 0000 0000 0000 1111 1011 which is 251

huoyaoyuan commented 7 months ago

int8_t and int32_t are both signed. I'd expect static_cast<int32_t> to be sign extend.

dotMorten commented 7 months ago

@huoyaoyuan the static-cast code was just to show what happens to a roundtripped value, but we have code deeper down that checks ranges of the sbyte (in the specific case -12..12) that fails for all negative numbers. The issue only occured on ios and catalyst when running on ARM64. Windows ARM64 and Android ARM64 are just fine

rolfbjarne commented 7 months ago

I wonder if the same thing happens with System.Int16 (short) <-> int16_t?

huoyaoyuan commented 7 months ago

static_cast<int32_t>(value) isn't a NOP on other platforms:

In other words, these platforms are treating upper bits of int8_t as undefined.

Compiler explorer doesn't provide apple compilers, but I can imagine that they are defining the upper bits to be already sign extended.

dotMorten commented 7 months ago

@rolfbjarne we have not seen this happen with shorts.

jkotas commented 7 months ago

@jakobbotsch Do we have the sign and zero extensions implemented correctly in RyuJIT for Apple Arm64 ABI?

ivanpovazan commented 6 months ago

@jakobbotsch Do we have the sign and zero extensions implemented correctly in RyuJIT for Apple Arm64 ABI?

Seems like we do. I tested SByteToInt conversion with NativeAOT on iOS and it returns the expected result. @dotMorten did you have a chance of trying out NativeAOT on iOS with your projects?

Regarding Mono, the problem seems reproducible only if the native library is built with -Os:

clang++ -c -o lib.o lib.cpp -isysroot /Applications/Xcode_15.2.0.app/Contents/Developer/Platforms/iPhoneOS.platform/Developer/SDKs/iPhoneOS17.2.sdk -Os

Removing the size optimization flag when building the native library makes Mono produce the expected result as well.

We should look into why Mono fails when the library is built with -Os

jakobbotsch commented 6 months ago

I don't see where RyuJIT takes explicit care of this normalization. RyuJIT in general keeps all values in registers normalized up to 32 bits, which makes it very hard/impossible to hit issues around this with IL created from C#. But you can hit this with RyuJIT as well for IL based programs on both arm32 (which has the same ABI requirement) and apple arm64. I opened #101046 about it.

lambdageek commented 2 months ago

Looking at what apple clang produces with -S -emit-llvm it looks like the initial LLVM IR doesn't need to include the sign extension explicitly.

This:

#include <stdint.h>

int32_t foo (int8_t i)
{
    return i;
}

int32_t bar (int8_t i);

int32_t call_bar (int8_t i, int8_t j)
{
    return bar(i + j);
}

Produces the following LLVM IR with lang -Os -S -emit-llvm /tmp/foo.c -o foo.S -isysroot /Applications/Xcode-15.4.0.app/Contents/Developer/Platforms/iPhoneOS.platform/Developer/SDKs/iPhoneOS17.5.sdk:

 ModuleID = '/tmp/foo.c'
source_filename = "/tmp/foo.c"
target datalayout = "e-m:o-i64:64-i128:128-n32:64-S128"
target triple = "arm64-apple-ios17.5.0"

; Function Attrs: mustprogress nofree norecurse nosync nounwind optsize readnone ssp willreturn uwtable(sync)
define i32 @foo(i8 noundef signext %0) local_unnamed_addr #0 {
  %2 = sext i8 %0 to i32
  ret i32 %2
}

; Function Attrs: nounwind optsize ssp uwtable(sync)
define i32 @call_bar(i8 noundef signext %0, i8 noundef signext %1) local_unnamed_addr #1 {
  %3 = add i8 %1, %0
  %4 = tail call i32 @bar(i8 noundef signext %3) #3
  ret i32 %4
}

; Function Attrs: optsize
declare i32 @bar(i8 noundef signext) local_unnamed_addr #2

attributes #0 = { mustprogress nofree norecurse nosync nounwind optsize readnone ssp willreturn uwtable(sync) "frame-pointer"="non-leaf" "min-legal-vector-width"="0" "no-trapping-math"="true" "probe-stack"="__chkstk_darwin" "stack-protector-buffer-size"="8" "target-cpu"="apple-a7" "target-features"="+aes,+crypto,+fp-armv8,+neon,+sha2,+v8a,+zcm,+zcz" }
attributes #1 = { nounwind optsize ssp uwtable(sync) "frame-pointer"="non-leaf" "min-legal-vector-width"="0" "no-trapping-math"="true" "probe-stack"="__chkstk_darwin" "stack-protector-buffer-size"="8" "target-cpu"="apple-a7" "target-features"="+aes,+crypto,+fp-armv8,+neon,+sha2,+v8a,+zcm,+zcz" }
attributes #2 = { optsize "frame-pointer"="non-leaf" "no-trapping-math"="true" "probe-stack"="__chkstk_darwin" "stack-protector-buffer-size"="8" "target-cpu"="apple-a7" "target-features"="+aes,+crypto,+fp-armv8,+neon,+sha2,+v8a,+zcm,+zcz" }
attributes #3 = { nounwind optsize }

!llvm.module.flags = !{!0, !1, !2, !3, !4}
!llvm.ident = !{!5}

!0 = !{i32 2, !"SDK Version", [2 x i32] [i32 17, i32 5]}
!1 = !{i32 1, !"wchar_size", i32 4}
!2 = !{i32 8, !"PIC Level", i32 2}
!3 = !{i32 7, !"uwtable", i32 1}
!4 = !{i32 7, !"frame-pointer", i32 1}
!5 = !{!"Apple clang version 15.0.0 (clang-1500.3.9.4)"}

You get the same IR with apple clang targeting the standard ARM64 ABI: clang -Os -S /tmp/foo.c -o foo.S --target=arm64-linux-gnu

; ModuleID = '/tmp/foo.c'
source_filename = "/tmp/foo.c"
target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
target triple = "arm64-unknown-linux-gnu"

; Function Attrs: mustprogress nofree norecurse nosync nounwind optsize readnone willreturn uwtable
define dso_local i32 @foo(i8 noundef %0) local_unnamed_addr #0 {
  %2 = sext i8 %0 to i32
  ret i32 %2
}

; Function Attrs: nounwind optsize uwtable
define dso_local i32 @call_bar(i8 noundef %0, i8 noundef %1) local_unnamed_addr #1 {
  %3 = add i8 %1, %0
  %4 = tail call i32 @bar(i8 noundef %3) #3
  ret i32 %4
}

; Function Attrs: optsize
declare i32 @bar(i8 noundef) local_unnamed_addr #2

attributes #0 = { mustprogress nofree norecurse nosync nounwind optsize readnone willreturn uwtable "frame-pointer"="non-leaf" "min-legal-vector-width"="0" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="generic" "target-features"="+neon,+v8a" }
attributes #1 = { nounwind optsize uwtable "frame-pointer"="non-leaf" "min-legal-vector-width"="0" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="generic" "target-features"="+neon,+v8a" }
attributes #2 = { optsize "frame-pointer"="non-leaf" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="generic" "target-features"="+neon,+v8a" }
attributes #3 = { nounwind optsize }

!llvm.module.flags = !{!0, !1, !2, !3, !4}
!llvm.ident = !{!5}

!0 = !{i32 1, !"wchar_size", i32 4}
!1 = !{i32 8, !"PIC Level", i32 2}
!2 = !{i32 7, !"PIE Level", i32 2}
!3 = !{i32 7, !"uwtable", i32 2}
!4 = !{i32 7, !"frame-pointer", i32 1}
!5 = !{!"Apple clang version 15.0.0 (clang-1500.3.9.4)"}

However the final assembly differs.

Apple API: clang -Os -S /tmp/foo.c -o foo.S -isysroot /Applications/Xcode-15.4.0.app/Contents/Developer/Platforms/iPhoneOS.platform/Developer/SDKs/iPhoneOS17.5.sdk --target=arm64-apple-ios17.5.0

    .section    __TEXT,__text,regular,pure_instructions
    .build_version ios, 17, 5   sdk_version 17, 5
    .globl  _foo                            ; -- Begin function foo
    .p2align    2
_foo:                                   ; @foo
    .cfi_startproc
; %bb.0:
    ret
    .cfi_endproc
                                        ; -- End function
    .globl  _call_bar                       ; -- Begin function call_bar
    .p2align    2
_call_bar:                              ; @call_bar
    .cfi_startproc
; %bb.0:
    add w8, w1, w0
    sxtb    w0, w8
    b   _bar
    .cfi_endproc
                                        ; -- End function
.subsections_via_symbols

Linux ABI: clang -Os -S /tmp/foo.c -o foo.S -isysroot /Applications/Xcode-15.4.0.app/Contents/Developer/Platforms/iPhoneOS.platform/Developer/SDKs/iPhoneOS17.5.sdk --target=arm64-linux-gnu

    .text
    .file   "foo.c"
    .globl  foo                             // -- Begin function foo
    .p2align    2
    .type   foo,@function
foo:                                    // @foo
    .cfi_startproc
// %bb.0:
    sxtb    w0, w0
    ret
.Lfunc_end0:
    .size   foo, .Lfunc_end0-foo
    .cfi_endproc
                                        // -- End function
    .globl  call_bar                        // -- Begin function call_bar
    .p2align    2
    .type   call_bar,@function
call_bar:                               // @call_bar
    .cfi_startproc
// %bb.0:
    add w0, w1, w0
    b   bar
.Lfunc_end1:
    .size   call_bar, .Lfunc_end1-call_bar
    .cfi_endproc
                                        // -- End function
    .ident  "Apple clang version 15.0.0 (clang-1500.3.9.4)"
    .section    ".note.GNU-stack","",@progbits
    .addrsig

Conclusion: we're probably not passing the correct target triple to something in our backend. either opt or llc or clang.

ivanpovazan commented 2 months ago

Investigation

I think the problem is that we are not decorating params/arguments fewer than 32bits with the correct LLVM IR attributes.

If we considering a simple caller.cpp

#include <cstdint>
extern "C" {int32_t SByteToInt(int8_t value);}
int main() {
    int result = SByteToInt(-5);
    return result;
}

Compiled with: clang++ -S -emit-llvm -o caller.S caller.cpp -isysroot /Applications/Xcode_15.2.0.app/Contents/Developer/Platforms/iPhoneOS.platform/Developer/SDKs/iPhoneOS17.2.sdk We get the following IR, where the param/arg have signext attribute associated with them:

; Function Attrs: noinline norecurse optnone ssp uwtable(sync)
define i32 @main() #0 {
  %1 = alloca i32, align 4
  %2 = alloca i32, align 4
  store i32 0, ptr %1, align 4
  %3 = call i32 @SByteToInt(i8 noundef signext -5)
  store i32 %3, ptr %2, align 4
  %4 = load i32, ptr %2, align 4
  ret i32 %4
}

declare i32 @SByteToInt(i8 noundef signext) #1

This gives the following assembly:

_main:
0000000000000000    sub sp, sp, #0x20
0000000000000004    stp x29, x30, [sp, #0x10]
0000000000000008    add x29, sp, #0x10
000000000000000c    stur    wzr, [x29, #-0x4]
0000000000000010    mov w0, #-0x5 // <-- correct
0000000000000014    bl  0x14

However, when we use Mono with LLVM that PInvokes the same function we generate:

; Function Attrs: noinline
define hidden monocc i32 @Program_Program_EONEConvert() #2 gc "coreclr" !dbg !89 {
...
gc.safepoint_poll.exit:                           ; preds = %BB0, %gc.safepoint_poll.poll.i
  %2 = notail call monocc i32 @p_8_plt_Program_Program_SByteToInt_sbyte_llvm(i8 -5), !dbg !90, !managed_name !91
  ret i32 %2, !dbg !92
}

declare hidden i32 @p_8_plt_Program_Program_SByteToInt_sbyte_llvm(i8) local_unnamed_addr #0

which gives following assembly:

_Program_Program_MyConvert:
0000000000000290    stp    x29, x30, [sp, #-0x10]!
0000000000000294    adrp    x8, 0 ; 0x0
0000000000000298    ldr    x8, [x8]
000000000000029c    ldr    x8, [x8]
00000000000002a0    cbnz    x8, 0x2b4
00000000000002a4    mov    w0, #0xfb // <-- wrong
00000000000002a8    bl    0x2a8

Solution

Adapt https://github.com/dotnet/runtime/blob/main/src/mono/mono/mini/mini-llvm.c#L2333 (and other places) to properly decorate params/args with the correct set of https://llvm.org/docs/LangRef.html#parameter-attributes

FWIW Using mini-jit backend alone, produces the correct result:

Program_EONEConvert:
00000001003e8280    stp x29, x30, [sp, #-0x10]!
00000001003e8284    mov x29, sp
00000001003e8288    adrp    x16, 2243 ; 0x100cab000
00000001003e828c    add x16, x16, #0xa50
00000001003e8290    ldr x0, [x16, #0x38]
00000001003e8294    ldr x17, [x0]
00000001003e8298    cbz x17, 0x1003e82a0
00000001003e829c    bl  plt
00000001003e82a0    mov x0, #-0x5
00000001003e82a4    bl  plt_Program_SByteToInt_sbyte

dotnet / runtime