Vector35 / binaryninja-api

Public API, examples, documentation and issues for Binary Ninja
https://binary.ninja/
MIT License
897 stars 200 forks source link

Create workflow for removing Swift pointer encoding #3902

Open comex opened 1 year ago

comex commented 1 year ago

Version and Platform (required):

Bug Description: Swift (at least on arm64 macOS) has an odd way of referring to string literals. Here is the original assembly produced by swiftc:

    adrp    x8, l_.str@PAGE
    add x8, x8, l_.str@PAGEOFF
    sub x8, x8, #32
    orr x1, x8, #0x8000000000000000

Or as disassembled by Binary Ninja (the string ended up at 0x100103f70):

100003f54  08080090   adrp    x8, 0x100103000
100003f58  08c13d91   add     x8, x8, #0xf70
100003f5c  088100d1   sub     x8, x8, #0x20
100003f60  010141b2   orr     x1, x8, #0x8000000000000000  {0x8000000100103f50}

The problem is that Binary Ninja doesn't create an xref to 0x100103f70, presumably because it emulates the whole sequence of operations and ends up with 0x8000000100103f50.

Using the decompiler for xrefs is often helpful, but here it's counterproductive compared to a more naive approach of looking for adrp/add pairs.

Ideally, Binary Ninja would be able to identify these references.

Steps To Reproduce:

Disassemble this test binary and go to the __cstring section. Note that there is no reference to the string.

This corresponds to the following source code:

public func get_string() -> String {
    return "this is a long string so it doesn't get small-string optimized"
}

Note that I had to add a bunch of padding between the code and the string. Without this, the linker will replace the adrp/add pair with adr/nop, and Binary Ninja does identify the reference in that case.

Additional Information: There is nothing meaningful located 0x20 bytes before the string (the string is at the very start of the section), so the subtraction of 0x20 is just part of some pointer encoding scheme, along with the OR of 0x8000000000000000. Not sure about the details of this scheme.

plafosse commented 1 year ago

Ok I looked into this and the way that I think this needs to happen here we need to handle the specific swift specific string pointer encoding via a workflow. This would effectively rewrite the referenced function to be:

100003f54  08080090   adrp    x8, 0x100103000
100003f58  08c13d91   add     x8, x8, #0xf70  {data_100103f70, "this is a long string so it doesn't get small-string optimized"}
100003f5c  1f2003d5   nop     
100003f60  e10308aa   mov     x1, x8  {data_100103f70, "this is a long string so it doesn't get small-string optimized"}
100003f64  c00780d2   mov     x0, #0x3e
100003f68  0000faf2   movk    x0, #0xd000, lsl #0x30  {0xd00000000000003e}
100003f6c  c0035fd6   ret     

The workflow would do this but on LLIL or MLIL instead of operating on the assembly directly

zhangyoufu commented 11 months ago

https://github.com/apple/swift/blob/8b40353e22fdcc75f9bd8c172ee3ce1067f5c810/stdlib/public/core/StringObject.swift#L339-L343

Native strings have tail-allocated storage, which begins at an offset of nativeBias from the storage object's address. String literals, which reside in the constant section, are encoded as their start address minus nativeBias, unifying code paths for both literals ("immortal native") and native strings. Native Strings are always managed by the Swift runtime.

https://github.com/apple/swift/blob/8b40353e22fdcc75f9bd8c172ee3ce1067f5c810/stdlib/public/core/StringObject.swift#L692-L694

b61: isNativelyStored. set for native stored strings

  • largeAddressBits holds an instance of _StringStorage.
  • I.e. the start of the code units is at the stored address + nativeBias

https://github.com/apple/swift/blob/8b40353e22fdcc75f9bd8c172ee3ce1067f5c810/stdlib/public/core/StringObject.swift#L439-L447

  internal static var nativeBias: UInt {
#if _pointerBitWidth(_64)
    return 32
#elseif _pointerBitWidth(_32)
    return 20
#else
#error("Unknown platform")
#endif
  }

https://github.com/apple/swift/blob/8b40353e22fdcc75f9bd8c172ee3ce1067f5c810/include/swift/AST/Builtins.def#L290-L300

/// valueToBridgeObject(x) === (x << _swift_abi_ObjCReservedLowBits) |
///     _swift_BridgeObject_TaggedPointerBits
l0psec commented 1 month ago

To add to the conversation. "0x8000000000000000" in the most significant bits of the bridge object that identify these large immortal strings are defined here:

// Discriminator for large, immortal, swift-native strings
  @inlinable @inline(__always)
  internal static func largeImmortal() -> UInt64 {
#if os(Android) && arch(arm64)
    return 0x0080_0000_0000_0000
#else
    return 0x8000_0000_0000_0000
#endif
  }

An example from the decompilation of a sample I looked at recently:

100002fec          // /Users/Shared/1.zip
100002fec          URL.init(fileURLWithPath:)(0xd000000000000013, 0x8000000100003cc0)

I wrote a quick and small script starting with this to parse these, add the bias (+0x20), pass address to bv.get_string_at(), and write a comment at the caller but having this built into the workflow would be great.

def find_calls(i):
  match i:
     case HighLevelILCall():
         return i

Also small string parsing would be nice as well. Small Immortal strings are passed like this:

1000032a4      String.append(_:)(0x65676465682e, 0xe600000000000000)
1000032bc      String.append(_:)(0x6376676f68, 0xe500000000000000)

These are ascii due to the bridge object starting with 0xe, which is '0b1110' and matches this chart:

Screenshot 2024-08-02 at 1 59 02 PM

If the string is more than 8 bytes, the remaining hex values bleed into the bridge object.