M1/M2 Compatibility - Githubissues

gilaroni commented 1 year ago

will there be a version for m1?

LouisBrunner commented 1 year ago

Hi @gilaroni,

I have started on M1 support, it might take a bit but it's on the roadmap, yes.

MalwarePup commented 1 year ago

Hello, any news here?

eliesaikali commented 1 year ago

Any news here, its always not possible to install valgrind ....

brew install --HEAD LouisBrunner/valgrind/valgrind ==> Fetching louisbrunner/valgrind/valgrind ==> Cloning https://github.com/LouisBrunner/valgrind-macos.git Updating /Users/xxx/Library/Caches/Homebrew/valgrind--git ==> Checking out branch main Already on 'main' Your branch is up to date with 'origin/main'. HEAD is now at ee485f9ab docs: Update README for Homebrew error (#72) ==> Installing valgrind from louisbrunner/valgrind ==> ./autogen.sh ==> ./configure --prefix=/opt/homebrew/Cellar/valgrind/HEAD-ee485f9 --enable-only64bit --build=amd64-darwin ==> make Last 15 lines from /Users/xxx/Library/Logs/Homebrew/valgrind/03.make: fixup_macho_loadcmds.c:465:22: error: use of undeclared identifier 'x86_thread_state64_t' = (x86_thread_state64_t)(&w32s[2]); ^ fixup_macho_loadcmds.c:467:36: error: no member named 'rsp' in 'struct darwin_arm_thread_state64'; did you mean 'sp'? init_rsp = state64->rsp; ^~~~~ sp /Library/Developer/CommandLineTools/SDKs/MacOSX13.sdk/usr/include/mach/arm/_structs.h:141:13: note: 'sp' declared here __uint64_t __sp; / Stack pointer x31 */ ^ 7 errors generated. make[2]: [fixup_macho_loadcmds] Error 1 make[2]: Waiting for unfinished jobs.... make[1]: [all-recursive] Error 1 make: [all] Error 2

If reporting this issue please do so at (not Homebrew/brew or Homebrew/homebrew-core): https://github.com/louisbrunner/homebrew-valgrind/issues

valgrind's formula was built from an unstable upstream --HEAD. This build failure is expected behaviour. Do not create issues about this on Homebrew's GitHub repositories. Any opened issues will be immediately closed without response. Do not ask for help from Homebrew or its maintainers on social media. You may ask for help in Homebrew's discussions but are unlikely to receive a response. Try to figure out the problem yourself and submit a fix as a pull request. We will review it but may or may not accept it.

TommyJD93 commented 1 year ago

any update for the M1/M2 compatibility?

hacknus commented 1 year ago

ran into the same error as @eliesaikali on my M2 Pro MacBook Pro running macOS Ventura 13.2.1 (22D68).

whisper-bye commented 1 year ago

+1

carl-alphonce commented 1 year ago

M1 Pro MacBook Pro running macOS Venture 13.3.1. Same issue as @eliesaikali and @hacknus.

fakecore commented 1 year ago

Any details about the roadmap of Valgrind runs in the M1 arch?

zakariazh commented 1 year ago

any update for the M1/M2 compatibility?

moliqingwa commented 1 year ago

+1

JoonasMykkanen commented 1 year ago

Still need help?

kalip2 commented 1 year ago

Would really like Valgrind w/ MacOS Silicon for school purposes. We're using it at school!

brew install --HEAD LouisBrunner/valgrind/valgrind                  6s
Error: Valgrind is currently incompatible with ARM-based Macs, see https://github.com/LouisBrunner/valgrind-macos/issues/56

kalip2 commented 1 year ago

Pleeeassseee bump this up. We're using valgrind in our cs101 class to help with memory leaks in projects related to singly linked lists and pointers. We're also going to need valgrind in our upcoming cs102 class. I don't mind using our college's ubuntu desktops with valgrind, but the labs close at 10pm, and a lot of us do our best studying after 10pm.

paulfloyd commented 1 year ago

Speaking from experience, a huge amount of work is needed to get things running smoothly.

MartinDelille commented 1 year ago

What are the alternative to valgrind on MacOS what would be usable in a continuous integration environment ?

eliesaikali commented 1 year ago

What are the alternative to valgrind on MacOS what would be usable in a continuous integration environment ?

@MartinDelille You can use leaks if you like on MacOS

MartinDelille commented 1 year ago

I wasn't aware of this xcode tool! Thanks a lot! 👌

MalwarePup commented 1 year ago

Speaking from experience, a huge amount of work is needed to get things running smoothly.

Yes probably, but some people want to contribute like @JoonasMykkanen, but he doesn't get any answer

LouisBrunner commented 1 year ago

Hi,

Sorry for the lack of update, I have been away for the last month.

As @paulfloyd said, this is not an easy issue to resolve, it's something I have been working on for nearly 3 years (initial push before M1 release in 2020, despair through 2021-2022 as it seemed unfixable, started back up this year) and very actively trying to fix since earlier this year.

You can check my progress on my working branch here: https://github.com/LouisBrunner/valgrind-macos/tree/feature/m1

Here is a breakdown on the different issues to be resolved.

TL;DR: I am still working on it, it's hard work and I don't see it completed before end-of-year, join the Discord server (https://discord.gg/mU9FG3T5jF) if you want to contribute

Compilation: static vs dynamic executables

The first massive hurdle which took me so long to get around is just building Valgrind correctly.

Some context: XNU arm64 has constraints that are massively more restrictive than amd64, I imagine that this is because it is used for iOS which is a much more locked down system (but macOS is going that way too).

Valgrind's tools like memcheck or helgrind are built as static binaries. This is to be able to place Valgrind's code and stack away from the default place where it is normally put in memory. The main reason for that is so there is no conflict when we load the guest binary (e.g. ls), at least from what I gathered from the comments in the code. XNU arm64 doesn't allow static binaries. That is currently a hard requirement of Valgrind that will never be allowed.

So I have been trying a lot of different methods to circumvent that, all of them basically turn around the idea of building Valgrind's tools (either statically or dynamically) then editing the binary to make it closer to what we want. However, this comes with massive drawbacks and issues, most of which I am yet to resolve fully. I have been using a tool called LIEF to do that (https://github.com/lief-project/LIEF) and it works quite well but I am still dependent on an unreleased version (v.1.4.0) as I had to PR a bunch of fixes to be able to do what I want to Valgrind.

Build statically and make dynamic

Fairly straight-forward, check coregrind/make_binary_dynamic_darwin.py for details:

make the binary PIE: anything else is forbidden by the kernel
add a DYLINKER entry: required in PIE
add an empty DySymTab: dyld crashes otherwise
add a Main entry: dyld doesn't load your binary otherwise
remove the UnixThread entry: can't have both Main and UnixThread

Pros:

Still build Valgrind tools statically
Not a lot of steps

Cons:

Stack is in the wrong place (defined by ASLR instead of us)
Have to "reverse-engineer" what a valid dynamic executable look like
Random crashes (C-strings not in the right places, I think depends on where the stack gets put)

Build dynamically and adapt to our purposes

Bit more complex, check coregrind/fixup_dynamic_binary_darwin.py for details:

Remove a bunch of extra sections and segments we don't need
Remove libSystem.B.dylib
Move every segment and section we want to keep by a certain offset some Valgrind is not loaded at the default address

Pros:

Don't have to "reverse-engineer" a valid dynamic executable

Cons:

Stack is in the wrong place (defined by ASLR instead of us)
Loads of fixups, dynamic executables have a lot more stuff included
Have to build with libSystem.B.dylib (just to remove it later)
Crashes (when allocating memories, but only sometimes)

Summary

I basically use a combination of both of those to be able to get further within Valgrind's execution and fix later bug. I don't really see how to make Valgrind stable in that aspect apart from building Valgrind dynamically and somehow find some hacks to address all the issues that come from that.

Code: reimplementing massive parts

Back in 2020, I used the iOS SDK to start building Valgrind without a M1 computer. I also used https://github.com/tyrael9/valgrind-ios as an inspiration to do that (I am actually hoping that Valgrind might be able to run on the latest iOS if M1 support goes through but this is in no way a priority or something I spend any energy on). Most of the code writing happened then and there is a lot.

Valgrind has a lot of assembly, most of which is dependent on both architecture (arm64, amd64, etc) and the OS (Linux, FreeBSD, macOS, etc). This is also true of many C code sections which are very keyed to the exact machine Valgrind will be running on. This means entire part of the program needs to be rewritten for macOS arm64.

Here is a (most likely non-exhaustive) list of outstanding issues related to that:

aspacem: I can't find references on how XNU allocates memory and how to divide the memory space between client and Valgrind. It looks like on arm64, macOS will often (always?) allocate memory at very low addresses, which is pretty much the reverse of what amd64 does. Without fixing this, the whole internal memory advisor of Valgrind is completely broken.
dispatch/signals: copied it from the Linux arm64 implementation, probably completely wrong
mach_traps/syswrap/syscall/pub_tool_machine: adapted from the macOS amd64 implementation, not sure if right
sigframe: copied the logic from macOS amd64 and the assembly-specific from Linux arm64, no idea if it works
GET_STARTREGS/get_StackTrace_wrk/N_CFI_REGS: assumed to be the same as Linux arm64, no idea if that's correct
arm64_darwin_REDIR_FOR_strcat/arm64_darwin_REDIR_FOR_strcpy/arm64_darwin_REDIR_FOR_strlcat/arm64_darwin_REDIR_FOR_strchr/fixup_guest_state_after_syscall_interrupted: unimplemented

The main issue with those assumptions is that I have no way to check them and correct them because I still can't run a program in Valgrind.

Running a program

After I recently got a running version of memcheck, I started trying to get a simple program (e.g. ls) to run. Despite everything building, I still had a lot of issues at runtime, here is a few examples:

macOS arm64 doesn't allow read-write-exec memory maps, this is (as many things with Apple) for security reasons: https://developer.apple.com/documentation/apple-silicon/porting-just-in-time-compilers-to-apple-silicon Valgrind makes extensive uses of those RWX mmaps (for similar reasons as well) and that means we need to adapt to those new requirements: we need to manually change if we are allowed to write or exec on each thread, which should be straight-forward but I need to find where to change all those permissions. Moreover, the function to change the permissions is not documented by Apple and we cannot use it due to our whole static executable thing (no system dylib allowed!). So I had to reimplement it in assembly.
Signal handling works differently on arm64 that amd64, with sigreturn having to be called differently (super hard to debug)
Had to reimplement the longjmp mechanism a bunch of times until I found something that worked
machine_get_hwcaps was absolutely not ready for how restrictive Apple is with system registers
Had to split __start_in_C_darwin in 2 as the arguments are passed completely differently between arm64 and x86
Had to reimplement do_syscall_unix_WRK a few times in arm64

Each of those issues can take a few hours to diagnose, debug and fix, which makes for very slow (and depressing) progress.

Currently, Valgrind will run until it starts to map the client binary (e.g. ls) in memory, which fails for various reasons (which also depend how you compile it static vs dynamic base).

So, where are now?

I am still committed to get Valgrind working on arm64. I might not update frequently because it's very slow progress and I often don't have anything to share apart from "it still doesn't work". However I do work regularly on it, sinking hours and hours trying to get this port to work. I wish I could work on Valgrind full-time or 5 days/week, unfortunately it is not possible as it doesn't pay my bills.

I struggle to see how it could be ready before end of the year (but I do hope we will be much closer to a working version then).

Contributing

I would love for people to contribute, and if you are interested (e.g. @JoonasMykkanen), definitely reach out, I just created a Discord server for ease of communication: https://discord.gg/mU9FG3T5jF

However, Valgrind is a complex program with massive codebase, it does very complicated things without any external libraries. But I would be happy to take some extra time to train people so they can help.

Do understand that I just don't know what is broken and what isn't, porting to arm64 is levels of magnitude above any past fixes I had to do (even the dyld cache support for macOS 11 and later). It will take a lot of time and effort.

LouisBrunner commented 11 months ago

Quick update on arm64 support.

I solved a bunch of issues (fixed memory management, fixed mach_traps, finished and tested the arm64 assembly redirs, added support for FEAT_PAuth to Vex) and Valgrind is now starting to run the guest binary (still crashing in dyld for now).

Overall, that's quite positive and we are getting close to an experimental version that might be able to run basic programs on Apple Silicon.

Here are the top priorities to fix (in descending order):

Compilation: still a massive blocker and the next thing to tackle as it now makes progress near impossible
Syscalls: arm64 seems to have a different way from amd64 to handle syscall classes which Valgrind doesn't like, the assembly needs reviewing
Signals/Fork/Threads (syswrap): not been hit yet and most likely broken
Not sure: GET_STARTREGS, get_StackTrace_wrk, N_CFI_REGS, pub_tool_machine

lycfr commented 7 months ago

any update for the M1/M2 compatibility?

LouisBrunner commented 7 months ago

Hi @lycfr,

There still hasn't been any new breakthrough on the compilation issue. So no changes in support so far unfortunately.

paulfloyd commented 6 months ago

I recently got a Raspberry Pi 5 and I've started work on adding support for FreeBSD arm64.

Initially that went well. It didn't take long to get things to compile and to be able to load guest binaries. However I'm now stuck early on in program startup, getting various crashes in the backend generated code. Not sure yet what is causing this - memory layout, TLS.

Anyway, if I get it to work fairly well and merge it upstream it may help a bit with Darwin arm64.

LouisBrunner commented 6 months ago

Quick update, I already posted this on Discord, but I found a nice workaround for the compilation issue and managed to load the guest binary quite reliably.

@paulfloyd Same thing for me. I am getting crashes from the is_imm64_to_ireg_EXACTLY4 assertion and random SIGILLs from the "event check" part of the VEX-translated source block. I don't have a clue where either of those issues comes from at the moment. Definitely keep me updated if you fix anything.

paulfloyd commented 6 months ago

@LouisBrunner OK, very interesting. I'm seeing exactly the same thing.

When the assert happens the guest code is

==11397== at 0x4016BD8: ??? (in /libexec/ld-elf.so.1)

Which is

0x15e38 + 0x1bd8 = 0x17a10

   179e8: aa1803e1      mov     x1, x24
   179ec: f9001118      str     x24, [x8, #0x20]
   179f0: 940030b9      bl      0x23cd4 <_rtld_atfork_post+0x2e98>
   179f4: f9430b57      ldr     x23, [x26, #0x610]
   179f8: 9108f2e9      add     x9, x23, #0x23c
   179fc: f9400128      ldr     x8, [x9]
   17a00: b240010a      orr     x10, x8, #0x1
   17a04: f94417e8      ldr     x8, [sp, #0x828]
   17a08: f900012a      str     x10, [x9]
   17a0c: b40000a8      cbz     x8, 0x17a20 <__tls_get_addr+0xee8>
   17a10: f9400508      ldr     x8, [x8, #0x8]

And the assert in VEX is

#0  chainXDirect_ARM64 (endness_host=VexEndnessLE, place_to_chain=0x10029911e8, 
    disp_cp_chain_me_EXPECTED=0x3817fc1c <vgPlain_disp_run_translations+76>, place_to_jump_to=0x1002991220) at priv/host_arm64_defs.c:6093

Initially I thought that I was getting some address range error and writing into and corrupting the generated code. If you're seeing the same thing that is less likely. I did try Linux arm64 with clang (but using libstdc++ not libc++ - don't think that makes a difference) and I didn't see any errors like this. I'll see if I can use ld.lld.

LouisBrunner commented 6 months ago

It's difficult to pin down because I have many different scenarios.

Running with vgdb

I get SIGILLs during the early stage of dyld setup (around this area https://github.com/apple-oss-distributions/dyld/blob/d1a0f6869ece370913a3f749617e457f3b4cd7c4/dyld/dyldMain.cpp#L1195). The exact guest RIP or instruction is never the same as the crash happens in the VEX-generated code. It's the evCheck which looks something like this:

->  0x700000fb9f30: ldur   w9, [x21, #0x8]
    0x700000fb9f34: subs   w9, w9, #0x1
    0x700000fb9f38: stur   w9, [x21, #0x8]
    0x700000fb9f3c: b.pl   0x700000fb9f48

IIRC the crash is always on the load.

Running with lldb

I get a crash on the is_imm64_to_ireg_EXACTLY4 assert in chainXDirect_ARM64 like you. I think this is because lldb is able to bypass the SIGILLs (sometimes it reports them, sometimes it silences them and sometimes you can bypass them, not sure how/why).

Running directly

A healthy mix of SIGILLs, asserts and sometimes mmap failure (not sure yet why this is so mercurial). What is so odd to me is that it's so consistent. As I am doing this testing, I only get SIGILLs but yesterday when I tried it was only asserts. This morning it was mmap failures.

Summary

While it's really likely we are encountering the same issue, I also have other problems which might be causing this. Or the evCheck probem and the assertion are related somehow.

paulfloyd commented 6 months ago

I also get SIGILLs, for instance if I single step using vgdb or --vex-guest-max-insns=1

This isn't code that I know at all well unfortunately.

This looks too similar for it to be a coincidence. I don't see much in the way of platform dependent stuff in any of the files, so it's a bit of a mystery why there is no problem on Linux. I tried adding an "vassert(False)" in chainXDirect_ARM64 on Linux this morning and it triggered straight away, so the problem isn't that macOS and FreeBSD use that function and not Linux.

paulfloyd commented 6 months ago

If I understand the code correctly it works as follows.

place_to_chain points to generated code that performs the 4 opcode load of x9 and the blr subroutine call to the address in x9.

(gdb) x /5i place_to_chain
   0x1002991468:        mov     x9, #0x1220                     // #4640
   0x100299146c:        movk    x9, #0x299, lsl #16
   0x1002991470:        movk    x9, #0x10, lsl #32
   0x1002991474:        movk    x9, #0x0, lsl #48
   0x1002991478:        br      x9

Then the assert is checking that the address above corresponds to disp_cp_chain_me_EXPECTED

I think that address is 0x1002991220 which contains

(gdb) x /i 0x1002991220
   0x1002991220:        ldur    w9, [x21, #8]

That address should match disp_cp_chain_me_EXPECTED, but that contains something completely different.

(gdb) p disp_cp_chain_me_EXPECTED
$5 = (const void *) 0x38180094 <vgPlain_disp_run_translations+76>

The function in the assert is a bit hard to follow as it doesn't extract the address and compare the two addresses, it generates opcodes from the address and compares the opcodes.

My thoughts at the moment are that this us a problem of matching chainXDirect_ARM64 and unchainXDirect_ARM64. It looks to me like "chain" is being called on instructions that have been already "chained" once but not "unchained".

paulfloyd commented 6 months ago

And there may be something in this bugzilla https://bugs.kde.org/show_bug.cgi?id=412377 since there is some connection between chaining and the icache:

   VexInvalRange vir
       = LibVEX_UnChain( arch_host, endness_host, place_to_patch, 
                         place_to_jump_to_EXPECTED, disp_cp_chain_me );
   VG_(invalidate_icache)( (void*)vir.start, vir.len );

paulfloyd commented 6 months ago

Hmm no Linux doesn't seem to use unchainXDirect_ARM64 but every place_to_chain address is unique. That's not the case on FreeBSD.

LouisBrunner commented 6 months ago

I also get SIGILLs, for instance if I single step using vgdb or --vex-guest-max-insns=1

Good shout, --vex-guest-max-insns=1 is very good at raising SIGILLs within the first few instructions.

This isn't code that I know at all well unfortunately.

Me neither, it's quite tough.

This looks too similar for it to be a coincidence. I don't see much in the way of platform dependent stuff in any of the files, so it's a bit of a mystery why there is no problem on Linux. I tried adding an "vassert(False)" in chainXDirect_ARM64 on Linux this morning and it triggered straight away, so the problem isn't that macOS and FreeBSD use that function and not Linux.

That's true but while using breakpoints in LLDB I was able to run dyld for a while before it crashed, quite far actually. It also crashed for a completely unrelated reason (EXC_BAD_ACCESS). Thus I think something very odd is afoot if slowing down execution/using breakpoints bypassed any such issue.

If I understand the code correctly it works as follows.

Thanks for the deep dive, I haven't looked at the assert much because the SIGILL is much more problematic for me right now.

And there may be something in this bugzilla https://bugs.kde.org/show_bug.cgi?id=412377 since there is some connection between chaining and the icache:

This is also where I have to do special JIT memory permission tricks due to changes in Apple Silicon (https://developer.apple.com/documentation/apple-silicon/porting-just-in-time-compilers-to-apple-silicon#Disable-Write-Protections-Before-You-Generate-Instructions). I wonder if this plays a factor as well.

Hmm no Linux doesn't seem to use unchainXDirect_ARM64 but every place_to_chain address is unique. That's not the case on FreeBSD.

I haven't checked in details but they looked pretty unique on macOS. What do they look like on FreeBSD?

LouisBrunner commented 6 months ago

Quick update: thanks to @paulfloyd we have found the source of the SIGILLs/assertions (which was that cache invalidation for JIT was only enabled for Linux). This particular problem is now fixed on both macOS and FreeBSD. I also fixed some syscall issues and added even more support for new arm64 instructions.

While this is good news as Apple Silicon support is definitely moving forward given that Valgrind is now reliably running part of the guest binaries, it is not yet running them completely without crash. However we have never been this close to getting it working and I am very hopeful for the near future!

rogerburtonpatel commented 6 months ago

I don't have the bandwidth to contribute to this right now-- can I buy you a coffee or some takeout? You're holding all our hopes and dreams.

paulfloyd commented 6 months ago

FreeBSD arm64 is now working reasonably well. No signals or threads just yet.

== 709 tests, 235 stderr failures, 47 stdout failures, 5 stderrB failures, 6 stdoutB failures, 0 post failures ==

julienhouyet commented 6 months ago

Hi, any news for Mac OS ARM ? :)

LouisBrunner commented 6 months ago

Great news, I have managed to run a guest binary on M1 through Valgrind for the first time. But while I am very pleased, let me be absolutely clear: this is in no way stable and can at best be described as an experimental prototype. I just thought it was too much progress not to post it.

If you'd like to test it for yourself, you can clone this repo and build the feature/m1 branch directly. Do not expect it to work, as it probably won't (at the moment only simple Unix utilities, most likely without threads/forks work). If it crashes, do not open a new issue or post a comment here for now (I will close/delete them). Posting on Discord is fine however.

Summary of the roadmap ahead:

~Most binaries failing: need to investigate where those SIGSEGV come from (most likely initializers related) -> seemingly an issue when objc is being used~
~Many mmaps still failing with "Operation not permitted": no clue yet, very inconsistent (maybe some kind of kernel heuristic?)~
~Clearing initializers at startup: currently extremely hardcoded, might be doable using dyld (because Valgrind has this now)~
~mprotect + most errors: need to identify malloc metadata (would love to replicate the vmmap logic)~
~Other errors: combination of dyld shenanigans and other unknown issues~
Other errors: still a lot left, esp with libxpc, libobjC, malloc shenanigans, etc etc etc
VEX PAUTH support: need actual support in place, or maybe passing directly to host
~Malloc/free tracking is broken: just like on macOS 11 x86 when it came out~
~sys_icache_invalidate/___chkstk_darwin: not GPL compatible, will need a complete rewrite~
~MDEP Syscall class: clean up as it doesn't seem to exist in arm~
Threads: currently broken, crash with NSThreadPoisoned
~Fork: needs reviewing, most likely broken~
Signals (syswrap/fixup_guest_state_after_syscall_interrupted): needs reviewing, most likely broken
~GET_STARTREGS, get_StackTrace_wrk, N_CFI_REGS, pub_tool_machine: needs reviewing~
~ALLOW_RWX_WRITE: arguments are probably wrong in place~
Tests: they don't even build yet

Miljoen commented 5 months ago

Thanks for all of the hard work on this @LouisBrunner, getting valgrind to work for Apple Silicon is going to be so great for anyone trying to do development in C/C++ on a Mac.

As it stands I switched to Ubuntu for my tasks that require memchecks, but just know that many of us cannot wait to jump back and check out any first stable version.

paulfloyd commented 5 months ago

And a bit of news from upstream. This morning I pushed the code for FreeBSD arm64. Other than the occasional failure related to setting up memory for new threads (I think) it works pretty well.

Might help a bit with stuff like signal resumption where Darwin and FreeBSD have very similar code.

rdoeffinger commented 4 months ago

Not sure if helpful, but FYI I got something semiworking on a complex app with 2 patches, one that adds a (hackish) DC_ZVA instruction implementation and another that avoids crashes trying to load debug info for a lot a system libraries. mypatch.diff.txt EDIT: Also needs this, as some things assume 64-byte cache lines it seems:

--- a/coregrind/m_machine.c
+++ b/coregrind/m_machine.c
@@ -1859,7 +1859,8 @@ Bool VG_(machine_get_hwcaps)( void )
      vai.hwcaps |= VEX_HWCAPS_ARM64_LRCPC;
      vai.hwcaps |= VEX_HWCAPS_ARM64_DIT;

-     ULong ctr_el0 = 0;
+     // 64 byte cachelines
+     ULong ctr_el0 = 0x00040004;
 #else
      r = VG_(sigprocmask)(VKI_SIG_UNBLOCK, &tmp_set, &saved_set);
      vg_assert(r == 0);

With this, memcheck seems to work perfectly for my case. --tool=massif however does not...

paulfloyd commented 4 months ago

There are a few bugzilla items similar to this.

First we need a better implementation of the mrs instructions.

Are they documented for Darwin?

LouisBrunner commented 4 months ago

Not sure if helpful, but FYI I got something semiworking on a complex app with 2 patches

Oh, very interesting! Does your app uses threads at all? Do you interface with objc in any way?

one that adds a (hackish) DC_ZVA instruction implementation and another that avoids crashes trying to load debug info for a lot a system libraries. mypatch.diff.txt EDIT: Also needs this, as some things assume 64-byte cache lines it seems:
--- a/coregrind/m_machine.c
+++ b/coregrind/m_machine.c
@@ -1859,7 +1859,8 @@ Bool VG_(machine_get_hwcaps)( void )
      vai.hwcaps |= VEX_HWCAPS_ARM64_LRCPC;
      vai.hwcaps |= VEX_HWCAPS_ARM64_DIT;

-     ULong ctr_el0 = 0;
+     // 64 byte cachelines
+     ULong ctr_el0 = 0x00040004;
 #else
      r = VG_(sigprocmask)(VKI_SIG_UNBLOCK, &tmp_set, &saved_set);
      vg_assert(r == 0);

Thanks, I will look into those. If you have any extra background on the reasoning behind those fixes, that would be very nice.

Some specific questions I have:

Regarding the debugging fix: does your binary have no debug symbols or truncated ones?
Regarding the cache lines: any reference for this?

With this, memcheck seems to work perfectly for my case. --tool=massif however does not...

If you get it working, feel free to post here. I am focusing mostly on memcheck as it's the default tool and there is still so much that is broken.

Are they documented for Darwin?

Couldn't find much personally. The OSS source code have a few mentions of the Apple-specific registries and that's kind of it. I don't even think you can run mrs at all in user-space (at least Valgrind always got SIGILL'd).

First we need a better implementation of the mrs instructions.

Not sure what you mean by this. I added a few extra ones here, here and here.

paulfloyd commented 4 months ago

I've only started looking at this.

My understanding is that the MSR instructions are kind of like a 'soft' version of CPUID. Unlike CPUID which returns baked-in values MRS traps to the kernel. What I don't know yet is whether it is always safe to use a dirty helper or not. It will be a problem if the kernel reports capabilities that Valgrind doesn't support.

rdoeffinger commented 4 months ago

First we need a better implementation of the mrs instructions.

I don't think so, the OS parts of macOS at least seem to not bother and just assume things, like the granularity at which dc zva works As far as I can tell the libc usages don't care what the MSR contains (understandable, the loop would be far more complex and costly if it did).

Oh, very interesting! Does your app uses threads at all? Do you interface with objc in any way?

No, no threads at all (by default). Just a command-line application primarily targetting Linux, so no objc either. There is a python dependency but for other reasons we can run that in a separate process, so as long as fork works all that can be skipped (though fork does not seem to work in massif is one of the problems).

Regarding the debugging fix: does your binary have no debug symbols or truncated ones?

Sorry, I should indeed have written more: there is no issue with my binary. The problem is with in-memory system libraries where this triggered. First I tried to skip all offending binaries by name in img_from_memory (it ONLY happens with system libraries and when img_from_memory is used), but it got a bit much and went with this more general solution.

Regarding the cache lines: any reference for this?

Not really, mostly guessing/trial and error. But the DCZID_EL0 register specifies the proper value, so could use that to double-check. The current code sets this to an invalid value, hoping that generic code would not then use this instruction, but macOS OS-level code is not generic... Purely by arch-spec we'd just need to make sure that the DCZID_EL0 return value and the size used by DC_ZVA match, but as far as I can tell that will not work on macOS, unless it all matches what the actual M processors do. DISCLAIMER: it's all guesswork from my side.

If you get it working, feel free to post here. I am focusing mostly on memcheck as it's the default tool and there is still so much that is broken.

Yes, makes total sense. The massif issues, whatever they are, seem a bit beyond my skills. For now I'll just hope that someone else does some fixes that just happen to fix that, too!

paulfloyd commented 4 months ago

First we need a better implementation of the mrs instructions.

I don't think so, the OS parts of macOS at least seem to not bother and just assume things, like the granularity at which dc zva works As far as I can tell the libc usages don't care what the MSR contains (understandable, the loop would be far more complex and costly if it did).

macOS isn't my only concern. I also want to ensure that everything works correctly on FreeBSD and Linux.

rdoeffinger commented 4 months ago

None of my patches should change anything for FreeBSD or Linux. At most there is a question if on non-macOS either the "dc zva" instruction should keep producing an illegal instruction error, or alternatively if DCZID_EL0 should be changed to report a value indicating support for dc zva. But that seems at best a tiny improvement of questionable practical relevance that would come far down the priority list.

paulfloyd commented 4 months ago

https://bugs.kde.org/show_bug.cgi?id=377966

rdoeffinger commented 4 months ago

That ticket is old and lacks reproducers, but it sounds like Android might have the same issue (optimized libraries not checking if the instruction is available before using it). If the cache line size is correctly detected by the (existing) code on Android, my patch should also work for that case. A bit polish and/or finally updating the code for DCZID_EL0 to return the instruction as supported might make sense though.

rdoeffinger commented 4 months ago

I put my patch also on that ticket, maybe there is some user feedback on it. Anyway feel free to re-use any part of my patches in any way you want.

Yuri6037 commented 4 months ago

Hello,

I've just tried building the M1 branch on my M1 Max and I got the exact same issue than rdoeffinger related to debug info. This is the log I get on ANY application before applying rdoeffinger patch:

➜  valgrind-macos git:(feature/m1) ./vg-in-place ls
==18327== Memcheck, a memory error detector
==18327== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==18327== Using Valgrind-3.23.0.GIT-lbmacos and LibVEX; rerun with -h for copyright info
==18327== Apple Silicon support is currently experimental, see https://github.com/LouisBrunner/valgrind-macos/issues/56 in case of issues.
==18327== Command: ls
==18327== 
==18327== Valgrind: debuginfo reader: ensure_valid failed:
==18327== Valgrind:   during call to ML_(img_strdup)
==18327== Valgrind:   request for range [845575592, +1) exceeds
==18327== Valgrind:   valid image size of 777363456 for image:
==18327== Valgrind:   "/usr/lib/system/libsystem_darwindirectory.dylib"
==18327== 
==18327== Valgrind: debuginfo reader: Possibly corrupted debuginfo file.
==18327== Valgrind: I can't recover.  Giving up.  Sorry.
==18327==

After applying the patch and recompiling vg-in-place is now able to trace trivial applications such as ls. I have tried running a more advanced own app but it has 2 issues:

VG is unable to load a binary if it depends on a DYLD_LIBRARY_PATH. Environment seem completely ignored when launching vg.
When placing the dylib in a location the VG accepts it ends up throwing LOTs of memory errors.

This is the output of VG when attempting to trace a custom CLI tool using a Rust dylib (everything is built with debug info including the Rust dylib):

➜  valgrind-macos git:(feature/m1) ✗ DYLD_LIBRARY_PATH=../bpx-edit-core/target/debug/ ./vg-in-place ../bpx-edit-core/a.out
==19581== Memcheck, a memory error detector
==19581== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==19581== Using Valgrind-3.23.0.GIT-lbmacos and LibVEX; rerun with -h for copyright info
==19581== Apple Silicon support is currently experimental, see https://github.com/LouisBrunner/valgrind-macos/issues/56 in case of issues.
==19581== Command: ../bpx-edit-core/a.out
==19581== 
--19581-- run: /usr/bin/dsymutil "/Users/xx/Projects/valgrind-macos/libBPXEditCore.dylib"
--19581-- UNKNOWN mach_msg2 unhandled MACH64_MSG_VECTOR option
--19581-- UNKNOWN mach_msg2 unhandled MACH64_MSG_VECTOR option (repeated 2 times)
==19581== Conditional jump or move depends on uninitialised value(s)
==19581==    at 0x18B0F2154: _ds_item (in /usr/lib/system/libsystem_info.dylib)
==19581==    by 0x18B0F1E87: ds_user_byuid (in /usr/lib/system/libsystem_info.dylib)
==19581==    by 0x18B0F1B83: search_item_bynumber (in /usr/lib/system/libsystem_info.dylib)
==19581==    by 0x18B0F4287: getpwuid_r (in /usr/lib/system/libsystem_info.dylib)
==19581==    by 0x18B11E813: _CFStringGetUserDefaultEncoding (in /System/Library/Frameworks/CoreFoundation.framework/Versions/A/CoreFoundation)
==19581==    by 0x18B11E233: __CFInitialize (in /System/Library/Frameworks/CoreFoundation.framework/Versions/A/CoreFoundation)
==19581==    by 0x1028B105B: invocation function for block in dyld4::Loader::findAndRunAllInitializers(dyld4::RuntimeState&) const::$_0::operator()() const (in /usr/lib/dyld)
==19581==    by 0x1028EF307: invocation function for block in dyld3::MachOAnalyzer::forEachInitializer(Diagnostics&, dyld3::MachOAnalyzer::VMAddrConverter const&, void ( block_pointer)(unsigned int), void const*) const (in /usr/lib/dyld)
==19581==    by 0x1028E299B: invocation function for block in dyld3::MachOFile::forEachSection(void ( block_pointer)(dyld3::MachOFile::SectionInfo const&, bool, bool&)) const (in /usr/lib/dyld)
==19581==    by 0x1028922FB: dyld3::MachOFile::forEachLoadCommand(Diagnostics&, void ( block_pointer)(load_command const*, bool&)) const (in /usr/lib/dyld)
==19581==    by 0x1028E192F: dyld3::MachOFile::forEachSection(void ( block_pointer)(dyld3::MachOFile::SectionInfo const&, bool, bool&)) const (in /usr/lib/dyld)
==19581==    by 0x1028EEE1B: dyld3::MachOAnalyzer::forEachInitializer(Diagnostics&, dyld3::MachOAnalyzer::VMAddrConverter const&, void ( block_pointer)(unsigned int), void const*) const (in /usr/lib/dyld)
==19581== 
==19581== Conditional jump or move depends on uninitialised value(s)
==19581==    at 0x18ACE9FE4: object_getClass (in /usr/lib/libobjc.A.dylib)
==19581==    by 0x18ADC2A8B: xpc_dictionary_get_int64 (in /usr/lib/system/libxpc.dylib)
==19581==    by 0x18B0F217F: _ds_item (in /usr/lib/system/libsystem_info.dylib)
==19581==    by 0x18B0F1E87: ds_user_byuid (in /usr/lib/system/libsystem_info.dylib)
==19581==    by 0x18B0F1B83: search_item_bynumber (in /usr/lib/system/libsystem_info.dylib)
==19581==    by 0x18B0F4287: getpwuid_r (in /usr/lib/system/libsystem_info.dylib)
==19581==    by 0x18B11E813: _CFStringGetUserDefaultEncoding (in /System/Library/Frameworks/CoreFoundation.framework/Versions/A/CoreFoundation)
==19581==    by 0x18B11E233: __CFInitialize (in /System/Library/Frameworks/CoreFoundation.framework/Versions/A/CoreFoundation)
==19581==    by 0x1028B105B: invocation function for block in dyld4::Loader::findAndRunAllInitializers(dyld4::RuntimeState&) const::$_0::operator()() const (in /usr/lib/dyld)
==19581==    by 0x1028EF307: invocation function for block in dyld3::MachOAnalyzer::forEachInitializer(Diagnostics&, dyld3::MachOAnalyzer::VMAddrConverter const&, void ( block_pointer)(unsigned int), void const*) const (in /usr/lib/dyld)
==19581==    by 0x1028E299B: invocation function for block in dyld3::MachOFile::forEachSection(void ( block_pointer)(dyld3::MachOFile::SectionInfo const&, bool, bool&)) const (in /usr/lib/dyld)
==19581==    by 0x1028922FB: dyld3::MachOFile::forEachLoadCommand(Diagnostics&, void ( block_pointer)(load_command const*, bool&)) const (in /usr/lib/dyld)
==19581== 
==19581== Conditional jump or move depends on uninitialised value(s)
==19581==    at 0x18ACE9FE8: object_getClass (in /usr/lib/libobjc.A.dylib)
==19581==    by 0x18ADC2A8B: xpc_dictionary_get_int64 (in /usr/lib/system/libxpc.dylib)
==19581==    by 0x18B0F217F: _ds_item (in /usr/lib/system/libsystem_info.dylib)
==19581==    by 0x18B0F1E87: ds_user_byuid (in /usr/lib/system/libsystem_info.dylib)
==19581==    by 0x18B0F1B83: search_item_bynumber (in /usr/lib/system/libsystem_info.dylib)
==19581==    by 0x18B0F4287: getpwuid_r (in /usr/lib/system/libsystem_info.dylib)
==19581==    by 0x18B11E813: _CFStringGetUserDefaultEncoding (in /System/Library/Frameworks/CoreFoundation.framework/Versions/A/CoreFoundation)
==19581==    by 0x18B11E233: __CFInitialize (in /System/Library/Frameworks/CoreFoundation.framework/Versions/A/CoreFoundation)
==19581==    by 0x1028B105B: invocation function for block in dyld4::Loader::findAndRunAllInitializers(dyld4::RuntimeState&) const::$_0::operator()() const (in /usr/lib/dyld)
==19581==    by 0x1028EF307: invocation function for block in dyld3::MachOAnalyzer::forEachInitializer(Diagnostics&, dyld3::MachOAnalyzer::VMAddrConverter const&, void ( block_pointer)(unsigned int), void const*) const (in /usr/lib/dyld)
==19581==    by 0x1028E299B: invocation function for block in dyld3::MachOFile::forEachSection(void ( block_pointer)(dyld3::MachOFile::SectionInfo const&, bool, bool&)) const (in /usr/lib/dyld)
==19581==    by 0x1028922FB: dyld3::MachOFile::forEachLoadCommand(Diagnostics&, void ( block_pointer)(load_command const*, bool&)) const (in /usr/lib/dyld)
==19581== 
==19581== Use of uninitialised value of size 8
==19581==    at 0x18ACE9FEC: object_getClass (in /usr/lib/libobjc.A.dylib)
==19581==    by 0x18ADC2A8B: xpc_dictionary_get_int64 (in /usr/lib/system/libxpc.dylib)
==19581==    by 0x18B0F217F: _ds_item (in /usr/lib/system/libsystem_info.dylib)
==19581==    by 0x18B0F1E87: ds_user_byuid (in /usr/lib/system/libsystem_info.dylib)
==19581==    by 0x18B0F1B83: search_item_bynumber (in /usr/lib/system/libsystem_info.dylib)
==19581==    by 0x18B0F4287: getpwuid_r (in /usr/lib/system/libsystem_info.dylib)
==19581==    by 0x18B11E813: _CFStringGetUserDefaultEncoding (in /System/Library/Frameworks/CoreFoundation.framework/Versions/A/CoreFoundation)
==19581==    by 0x18B11E233: __CFInitialize (in /System/Library/Frameworks/CoreFoundation.framework/Versions/A/CoreFoundation)
==19581==    by 0x1028B105B: invocation function for block in dyld4::Loader::findAndRunAllInitializers(dyld4::RuntimeState&) const::$_0::operator()() const (in /usr/lib/dyld)
==19581==    by 0x1028EF307: invocation function for block in dyld3::MachOAnalyzer::forEachInitializer(Diagnostics&, dyld3::MachOAnalyzer::VMAddrConverter const&, void ( block_pointer)(unsigned int), void const*) const (in /usr/lib/dyld)
==19581==    by 0x1028E299B: invocation function for block in dyld3::MachOFile::forEachSection(void ( block_pointer)(dyld3::MachOFile::SectionInfo const&, bool, bool&)) const (in /usr/lib/dyld)
==19581==    by 0x1028922FB: dyld3::MachOFile::forEachLoadCommand(Diagnostics&, void ( block_pointer)(load_command const*, bool&)) const (in /usr/lib/dyld)
==19581== 
[SNIP]
==19581== 
==19581== Process terminating with default action of signal 5 (SIGTRAP)
==19581==    at 0x18CA3D9E0: _NSThreadPoisoned (in /System/Library/Frameworks/Foundation.framework/Versions/C/Foundation)
==19581==    by 0x18C272D07: -[NSThread init] (in /System/Library/Frameworks/Foundation.framework/Versions/C/Foundation)
==19581==    by 0x18CA3E6CF: ____mainNSThread_block_invoke (in /System/Library/Frameworks/Foundation.framework/Versions/C/Foundation)
==19581==    by 0x18AF0A3E7: _dispatch_client_callout (in /usr/lib/system/libdispatch.dylib)
==19581==    by 0x18AF0BC67: _dispatch_once_callout (in /usr/lib/system/libdispatch.dylib)
==19581==    by 0x18C272C9F: _NSThreadGet0 (in /System/Library/Frameworks/Foundation.framework/Versions/C/Foundation)
==19581==    by 0x18C2728FF: _NSInitializePlatform (in /System/Library/Frameworks/Foundation.framework/Versions/C/Foundation)
==19581==    by 0x18ACE7CF3: load_images (in /usr/lib/libobjc.A.dylib)
==19581==    by 0x1028A478F: dyld4::RuntimeState::notifyObjCInit(dyld4::Loader const*) (in /usr/lib/dyld)
==19581==    by 0x1028AD44F: dyld4::Loader::runInitializersBottomUp(dyld4::RuntimeState&, dyld3::Array<dyld4::Loader const*>&) const (in /usr/lib/dyld)
==19581==    by 0x1028AD3FF: dyld4::Loader::runInitializersBottomUp(dyld4::RuntimeState&, dyld3::Array<dyld4::Loader const*>&) const (in /usr/lib/dyld)
==19581==    by 0x1028AD3FF: dyld4::Loader::runInitializersBottomUp(dyld4::RuntimeState&, dyld3::Array<dyld4::Loader const*>&) const (in /usr/lib/dyld)
==19581== 
==19581== HEAP SUMMARY:
==19581==     in use at exit: 33,966 bytes in 492 blocks
==19581==   total heap usage: 609 allocs, 117 frees, 42,575 bytes allocated
==19581== 
==19581== LEAK SUMMARY:
==19581==    definitely lost: 10,336 bytes in 323 blocks
==19581==    indirectly lost: 0 bytes in 0 blocks
==19581==      possibly lost: 2,536 bytes in 38 blocks
==19581==    still reachable: 21,094 bytes in 131 blocks
==19581==         suppressed: 0 bytes in 0 blocks
==19581== Rerun with --leak-check=full to see details of leaked memory
==19581== 
==19581== Use --track-origins=yes to see where uninitialised values come from
==19581== For lists of detected and suppressed errors, rerun with: -s
==19581== ERROR SUMMARY: 1285 errors from 161 contexts (suppressed: 2 from 2)
[1]    19581 trace trap  DYLD_LIBRARY_PATH=../bpx-edit-core/target/debug/ ./vg-in-place

I wonder if these errors are due to VG internal issues or if they are real and Apply really needs to fix their dyld... Are these errors expected?

EDIT: I also wonder why it's calling _NSThreadPoisoned as in my test I have not used any threads.

LouisBrunner commented 4 months ago

@Yuri6037 FYI, I edited your comment to remove the bulk of the errors (as they aren't particularly useful for this discussion and can still be accessed through the Github edit history otherwise) because your message made it difficult to scroll through this issue.

Regarding the debugging fix: does your binary have no debug symbols or truncated ones?

Sorry, I should indeed have written more: there is no issue with my binary. The problem is with in-memory system libraries where this triggered. First I tried to skip all offending binaries by name in img_from_memory (it ONLY happens with system libraries and when img_from_memory is used), but it got a bit much and went with this more general solution.

I've just tried building the M1 branch on my M1 Max and I got the exact same issue than rdoeffinger related to debug info. This is the log I get on ANY application before applying rdoeffinger patch:

Good to know that it isn't an isolated issue. I have yet to reproduce it, I will have to look into it a bit more.

VG is unable to load a binary if it depends on a DYLD_LIBRARY_PATH. Environment seem completely ignored when launching vg.

Valgrind should provide all DYLD env vars to the real dyld when running your binary. If you have a MRE, I would gladly look into that issue.

When placing the dylib in a location the VG accepts it ends up throwing LOTs of memory errors.

I wonder if these errors are due to VG internal issues or if they are real and Apply really needs to fix their dyld... Are these errors expected?

This is usually due to Valgrind not knowing about how the memory is laid out on macOS. In the past, they have been hidden through a copious amount of suppressions (a special Valgrind file). I haven't seen those specific one yet, so I don't know what's causing it but the general rule I found is: if you dig down to why an issue like _NSThreadPoisoned happens, you usually fix a lot of seemingly unrelated warnings (by mapping memory correctly).

EDIT: I also wonder why it's calling _NSThreadPoisoned as in my test I have not used any threads.

As stated in my latest update, this is currently expected. I haven't had time to investigate why that happens apart that objc is being loaded at some point. You don't have to use threads especially but if you use any macOS capability which rely on objc, you will have this crash.

Regarding the cache lines: any reference for this?

Not really, mostly guessing/trial and error. But the DCZID_EL0 register specifies the proper value, so could use that to double-check. The current code sets this to an invalid value, hoping that generic code would not then use this instruction, but macOS OS-level code is not generic... Purely by arch-spec we'd just need to make sure that the DCZID_EL0 return value and the size used by DC_ZVA match, but as far as I can tell that will not work on macOS, unless it all matches what the actual M processors do. DISCLAIMER: it's all guesswork from my side.

I am not very knowledgeable about this so thanks for this background as it will make easier to research and fix.

LouisBrunner / valgrind-macos

M1/M2 Compatibility #56

Compilation: static vs dynamic executables

Build statically and make dynamic

Build dynamically and adapt to our purposes

Summary

Code: reimplementing massive parts

Running a program

So, where are now?

Contributing

Running with vgdb

Running with lldb

Running directly

Summary