beehive-lab / mambo

A low-overhead dynamic binary instrumentation and modification tool for ARM (both AArch32 and AArch64 support) and RISC-V (RV64GC).
Apache License 2.0
318 stars 69 forks source link

A64: Add support ARMv9.0 instructions #100

Open ijan1 opened 10 months ago

ijan1 commented 10 months ago

Support for instructions up to ARMv9.0, most notably SVE and SVE2.

However, there are a few issues yet to be fixed. Firstly, on a system with a Vector Length (VL) of more than 128-bits it doesn't work. This has been tested with QEMU and natively on an AWS Graviton machine(c7g.4xlarge). The dispatcher for some reason takes the wrong path on a 'CBZ' instruction at around basic block 152. This leads to early program termination with a negative return status.

Secondly, it only works with a VL of 128-bits on a dynamically compiled (target) binary. Statically compiled ones just SEGFAULT.

Thirdly, having the instructions for pushing and popping the predicate registers makes it so that nothing works at all. Regardless of VL and how the (target) binary was compiled (dynamically or statically). I was thinking of disabling them temporarily because compiling Mambo with support for SVE and SVE2 instructions doesn't insert any such instructions, but that's not a guarantee with changes in compiler and/or compiler version.

With a VL of 128-bits (using QEMU), Highway[1] and ARMRAL[2] pass all SVE and SVE2 test.

I have attached a spreadsheet[3] with the SVE/SVE2 instructions added for easier visualisation and some notes on which ones I think might need a bit more work.

[1]https://github.com/google/highway [2] https://developer.arm.com/documentation/102249/2310/?lang=en [3] https://docs.google.com/spreadsheets/d/1jABn1_99A2nKaIqYgZTg_JrHdmxqhq-r4T9hTIFiKo0/edit?usp=sharing

IgWod commented 10 months ago

Hi @ijan1,

Thank you for your contribution.

The code changes look good to me, but before we merge, let's discuss points you've raised.

Firstly, on a system with a Vector Length (VL) of more than 128-bits it doesn't work.

At the start-up, MAMBO should check the vector length supported by the system, and quit if it's not 128-bits. Not having a general support it's okay, but should be signaled. But before fixing that, take a look at my comments regarding the predicates.

Secondly, it only works with a VL of 128-bits on a dynamically compiled (target) binary. Statically compiled ones just SEGFAULT.

Again, we should have a check for that. If SVE is detected, the loader should check if loaded binary is static and quit with a message.

Thirdly, having the instructions for pushing and popping the predicate registers makes it so that nothing works at all.

This is the biggest issue at the moment, as we get into it may work ground or may not ground. And it seems more and more to me, the failure is not SVE specific, but rather relates to push_neon/push_simd using more than 512 bytes. We could do a quick check: let's have a neon application (no SVE) and just push some extra stuff on the stack in push_neon (and pop accordingly), and see if it works. If it fails, then that means the problem is with the stack and some assumptions being baked in MAMBO with respect to push_neon/pop_neon. So, let's talk about it offline and have one more go, before going ahead with the merge, and finding some temporary solution to the issue.

IgWod commented 10 months ago

Me again,

I had another look at the aarch64 dispatcher, and found something interesting:

.global syscall_wrapper
.global syscall_wrapper_svc
syscall_wrapper:
  STP X30, X29, [SP, #-16]!
  BL push_x4_x21
  STP X0, X1, [SP, #-32]!
  STP X2, X3, [SP, #16]
  BL push_neon

  MRS X19, NZCV
  MRS X20, FPCR
  MRS X21, FPSR

  MOV X0, X8
  ADD X1, SP, #512
  MOV X2, X29
  LDR X3, disp_thread_data
  LDR X4, syscall_handler_pre_addr

  BLR X4

  CBZ X0, s_w_r

  ADD X9, SP, #512
  LDP X0, X1, [X9, #0]
  LDP X2, X3, [X9, #16]
  LDP X4, X5, [X9, #32]
  LDP X6, X7, [X9, #48]
  LDR X8,     [X9, #64]

  // Balance the stack on rt_sigreturn, which doesn't return here
  CMP X8, #0x8b
  BNE svc
  ADD SP, SP, #(64 + 144 + 512)

svc: SVC 0
syscall_wrapper_svc:
  ADD X1, SP, #512
  STR X0, [X1, #0]
  MOV X0, X8
  MOV X2, X29
  LDR X3, disp_thread_data
  LDR X4, syscall_handler_post_addr
  BLR X4

s_w_r:
  BL pop_neon
  MSR NZCV, X19
  MSR FPCR, X20
  MSR FPSR, X21

  LDP X2, X3, [SP, #16]
  LDP X0, X1, [SP], #32
  BL pop_x4_x21
  LDP X29, X30, [SP, #16]
  STP X0, X1, [SP, #16]
  LDP X0, X1, [SP], #16

  B checked_cc_return

See how 512 repeats multiple times in the context of the stack, and it also calls push_neon/pop_neon... And the problem with SVE keeps appearing when we push more than 512-bytes. Did you already look into it? If not, then this maybe where the problem is.