bloomberg / memray

Memray is a memory profiler for Python
https://bloomberg.github.io/memray/
Apache License 2.0
13.17k stars 392 forks source link

MacOS Sonoma: SIGBUS errors #476

Closed tsiq-karold closed 11 months ago

tsiq-karold commented 11 months ago

Is there an existing issue for this?

Current Behavior

I've been using memray this week to profile one of our apps and it has been tremendously helpful. Earlier today, I upgraded my Mac to Sonoma and I'm no longer able to profile this project. It crashes with a SIGBUS error:

Process finished with exit code 138 (interrupted by signal 10: SIGBUS)

Nothing has changed between it working and crashing, other than the MacOS upgrade. II tried just plain run and also live, both crash. I have also tried profiling a test Python script but that isn't crashing. So clearly something doesn't like something our app is doing but I'm not sure what the best way would be to try and debug this. Any suggestions welcome.

Expected Behavior

I'm able to successfully profile on MacOS Sonoma.

Steps To Reproduce

I tried creating a small reproducible example but I've not yet managed it.

Memray Version

1.9.1

Python Version

3.9

Operating System

macOS

Anything else?

No response

pablogsal commented 11 months ago

I tried in macOS Sonoma and I can run the test suite without problems. Do you have a reproducer I can execute?

pablogsal commented 11 months ago

Maybe can you also provide a stack trace with lldb?

tsiq-karold commented 11 months ago

I tried in macOS Sonoma and I can run the test suite without problems. Do you have a reproducer I can execute?

Unfortunately that's where I'm struggling. It's a private project and I've been trying to reproduce the problem in a small test but nothing I can share just yet. So far I've managed to narrow it down to two specific imports from an internal library. I'll keep digging in.

pablogsal commented 11 months ago

Without a reproducer we won't be able to help unfortunately. Maybe in the meantime you can provide a lldb backtrace?

tsiq-karold commented 11 months ago

Maybe can you also provide a stack trace with lldb?

Not sure it's giving me anything useful :)

Process 94883 launched: '/Users/karol/Library/Caches/pypoetry/virtualenvs/pic-submissions-zPmegehU-py3.9/bin/python' (arm64)
Writing profile results into memray-test.py.94883.bin
Process 94883 stopped
* thread #12, queue = 'com.apple.root.user-interactive-qos', stop reason = EXC_BAD_ACCESS (code=2, address=0x17037bff0)
    frame #0: 0x0000000181c52cbc libdyld.dylib`tlv_get_addr + 76
libdyld.dylib`tlv_get_addr:
->  0x181c52cbc <+76>: stp    x15, x16, [sp, #-0x10]!
    0x181c52cc0 <+80>: stp    q0, q1, [sp, #-0x20]!
    0x181c52cc4 <+84>: stp    q2, q3, [sp, #-0x20]!
    0x181c52cc8 <+88>: stp    q4, q5, [sp, #-0x20]!
Target 0: (python) stopped.
pablogsal commented 11 months ago

That seems to be only one frame, can you run "bt" to show all frames in the stack?

tsiq-karold commented 11 months ago

Ah, yes, sorry been a while since I used lldb...

(lldb) bt
* thread #12, queue = 'com.apple.root.user-interactive-qos', stop reason = EXC_BAD_ACCESS (code=2, address=0x17037bff0)
  * frame #0: 0x0000000181c52cbc libdyld.dylib`tlv_get_addr + 76
    frame #1: 0x00000001008a7088 _memray.cpython-39-darwin.so`memray::intercept::malloc(unsigned long) + 52
    frame #2: 0x00000001818df418 dyld`dyld4::RuntimeState::_instantiateTLVs(unsigned long) + 220
    frame #3: 0x0000000181c52cdc libdyld.dylib`tlv_get_addr + 108
    frame #4: 0x00000001008a7088 _memray.cpython-39-darwin.so`memray::intercept::malloc(unsigned long) + 52
    frame #5: 0x00000001818df418 dyld`dyld4::RuntimeState::_instantiateTLVs(unsigned long) + 220
    frame #6: 0x0000000181c52cdc libdyld.dylib`tlv_get_addr + 108
    frame #7: 0x00000001008a7088 _memray.cpython-39-darwin.so`memray::intercept::malloc(unsigned long) + 52
    frame #8: 0x00000001818df418 dyld`dyld4::RuntimeState::_instantiateTLVs(unsigned long) + 220
    frame #9: 0x0000000181c52cdc libdyld.dylib`tlv_get_addr + 108
    frame #10: 0x00000001008a7088 _memray.cpython-39-darwin.so`memray::intercept::malloc(unsigned long) + 52
    frame #11: 0x00000001818df418 dyld`dyld4::RuntimeState::_instantiateTLVs(unsigned long) + 220
// pattern above repeats many many times
    frame #1855: 0x00000001008a7088 _memray.cpython-39-darwin.so`memray::intercept::malloc(unsigned long) + 52
    frame #1856: 0x00000001818df418 dyld`dyld4::RuntimeState::_instantiateTLVs(unsigned long) + 220
    frame #1857: 0x0000000181c52cdc libdyld.dylib`tlv_get_addr + 108
    frame #1858: 0x00000001008a7598 _memray.cpython-39-darwin.so`memray::intercept::calloc(unsigned long, unsigned long) + 56
    frame #1859: 0x00000001818836dc libobjc.A.dylib`class_createInstance + 64
    frame #1860: 0x0000000181a9cd28 libdispatch.dylib`_os_object_alloc_realized + 32
    frame #1861: 0x000000018197f674 libxpc.dylib`_xpc_connection_create + 136
    frame #1862: 0x000000018196efe8 libxpc.dylib`xpc_connection_create_mach_service + 40
    frame #1863: 0x0000000181e82048 CoreFoundation`-[_CFXPreferences _copyDaemonConnectionSettingUpIfNecessaryForRole:andUserIdentifier:] + 452
    frame #1864: 0x0000000181e7f39c CoreFoundation`-[_CFXPreferences withConnectionForRole:andUserIdentifier:performBlock:] + 28
    frame #1865: 0x0000000181e55d4c CoreFoundation`__103-[CFPrefsSearchListSource synchronouslySendSystemMessage:andUserMessage:andDirectMessage:replyHandler:]_block_invoke + 148
    frame #1866: 0x0000000181a9d950 libdispatch.dylib`_dispatch_client_callout2 + 20
    frame #1867: 0x0000000181ab0ba0 libdispatch.dylib`_dispatch_apply_invoke + 176
    frame #1868: 0x0000000181a9d910 libdispatch.dylib`_dispatch_client_callout + 20
    frame #1869: 0x0000000181aaf3cc libdispatch.dylib`_dispatch_root_queue_drain + 864
    frame #1870: 0x0000000181aafa04 libdispatch.dylib`_dispatch_worker_thread2 + 156
    frame #1871: 0x0000000181c470d8 libsystem_pthread.dylib`_pthread_wqthread + 228

(edited to remove repeated sections)

pablogsal commented 11 months ago

Ugh, that looks like a stackoverflow :(

How are you getting Python 3.9?

tsiq-karold commented 11 months ago

With pyenv

pablogsal commented 11 months ago

I can try to give you a version of memray to check if it fixes this.

Can you confirm you can compile memray in your machine? (Build instructions are in the README)

tsiq-karold commented 11 months ago

I can try to give you a version of memray to check if it fixes this.

Can you confirm you can compile memray in your machine? (Build instructions are in the README)

I've never built it but I'll try it out this evening.

Thanks for the help and fast responses!

pablogsal commented 11 months ago

In any case I think we still need a reproducer to be able to understand how this is happening.

pablogsal commented 11 months ago

Can you check that memray from this branch crashes or works?

https://github.com/pablogsal/memray/tree/dlyd4

To build:

git clone https://github.com/pablogsal/memray.git
cd memray
git checkout dlyd4
brew install lz4
export CFLAGS="-I$(brew --prefix lz4)/include" LDFLAGS="-L$(brew --prefix lz4)/lib"
# Ensure you uninstall any memray in your venv first!
python -m pip install .

Then run memray over your application

pablogsal commented 11 months ago

@tsiq-karold are you by any chance using the pyobjc library in your app?

tsiq-karold commented 11 months ago

I'm still trying to narrow things down to a reproducible test that I can share.

I tried that branch and it still crashes with an identical looking stack. I also tried with a slightly modified version of your branch so I could be 100% it was using that version; my modification was simply to print the image_name and right before the crash it's definitely not showing anything resembling the names in that if statement.

We do not use pyobjc, no.

pablogsal commented 11 months ago

I'm still trying to narrow things down to a reproducible test that I can share.

I tried that branch and it still crashes with an identical looking stack. I also tried with a slightly modified version of your branch so I could be 100% it was using that version; my modification was simply to print the image_name and right before the crash it's definitely not showing anything resembling the names in that if statement.

We do not use pyobjc, no.

Could you paste here all the image_name values that happen before it crashes?

pablogsal commented 11 months ago

Meanwhile a way to narrow that is the following:

For every image_name, add a similar line as the one I added and check that doesn't crash. Then uncomment line by line until it crashes again. This can be done as binary search. Then tell us the library that makes it crash.

Notice that many libraries could remove the crash but the one we are looking for is a system library that's not the c or the cpp runtime.

pablogsal commented 11 months ago

Can you try also the last commit in:

https://github.com/pablogsal/memray/tree/dlyd4

(https://github.com/pablogsal/memray/commit/dd30b6571283129d27409a821c7912adb8b5a470)

tsiq-karold commented 11 months ago

Could you paste here all the image_name values that happen before it crashes?

This is the list as it gets printed out. There's a lot of duplication in there but I thought I'd leave it as is in case the order matters.

image_names.log

tsiq-karold commented 11 months ago

Can you try also the last commit in:

https://github.com/pablogsal/memray/tree/dlyd4

(pablogsal@dd30b65)

No crashes with this commit!!

pablogsal commented 11 months ago

Ok that's fantastic to know. We can try to trim that a bit. Can you try to change the conditional to:

strstr(image_name, "/usr/lib/") || strstr(image_name, "/System/")

And check if it crashes?

tsiq-karold commented 11 months ago

Do you mean the conditional that you modified in your first commit or the one from the second commit?

tsiq-karold commented 11 months ago

I did this:

                if (strstr(image_name, "/usr/lib/") || strstr(image_name, "/System/")) {
                    patch_stubs(section, slide, dyninfo_table, restore_original);
                }

and it started crashing again.

pablogsal commented 11 months ago

Can you try the following to see if it crashes (each separately as an addition to what is already there in vanilla memray):

strstr(image_name, "System/Library/PrivateFrameworks")

strstr(image_name, "System/Library/Frameworks")

strstr(image_name, "Services")

strstr(image_name, "MallocStack")

To be clear, you need to change it so it looks this way:


                if (strstr(image_name, "/usr/lib/") || strstr(image_name, NEW_OPTION_HERE)) {
                    patch_stubs(section, slide, dyninfo_table, restore_original);
                }
tsiq-karold commented 11 months ago

I've been doing something along the lines to narrow it down, will report once I find the culprit

tsiq-karold commented 11 months ago

OK, with this line in:

                if (strstr(image_name, "/usr/lib/") || strstr(image_name, "/System/Library/Frameworks/")
            ...
            || strstr(image_name, "/System/Library/PrivateFrameworks/MallocStackLogging.framework")
            ...
            ) {
                    patch_stubs(section, slide, dyninfo_table, restore_original);
                }

It crashes. Commenting out the /System/Library/PrivateFrameworks/MallocStackLogging.framework runs without crashing.

tsiq-karold commented 11 months ago

I should add that from everything in /System/Library, only PrivateFrameworks/MallocStackLogging.framework causes trouble. I'll try the other things you suggested.

tsiq-karold commented 11 months ago

No crashes with this:

if (!strstr(image_name, "/System/Library/PrivateFrameworks/MallocStackLogging.framework")) {
    patch_stubs(section, slide, dyninfo_table, restore_original);
}
pablogsal commented 11 months ago

Hummm I am confused, shouldn't my first commit in the branch cover that:

https://github.com/pablogsal/memray/commit/a7e1ad6b8a640ea8abf11bd6dd15d0e23b5b9d8e

pablogsal commented 11 months ago

That is skipping

if ( strstr(image_name, "/System/Library/PrivateFrameworks") 

Which should include that library no?

tsiq-karold commented 11 months ago

Apologies, I suspect this may have been user error... I just updated my local clone of your branch to effectively be your first commit and it's not crashing. It's also not crashing when making it more tightly scoped, i.e.:

        if (strstr(image_name, "/System/Library/PrivateFrameworks/MallocStackLogging.framework") || strstr(image_name, "libdyld.dylib") || strstr(image_name, "/usr/lib/system/")) {
            LOG(DEBUG) << "Skipping patching image: " << image_name;
            continue;
        }
pablogsal commented 11 months ago

Fantastic! This is enough for us to do a fix. We will release a new version with the fix shortly 😉

In the meantime you can hopefully use the custom memray build you have with the fix to profile your applications in the meantime :)

Btw, if you have some time it would be fantastic if you can leave a comment with your experience using memray in here:

https://github.com/bloomberg/memray/discussions/226

It really helps!

pablogsal commented 11 months ago

For the record, the problem is that apple has a debug malloc library that enables hooks that take snapshots of the stack in demand in that library:

https://developer.apple.com/library/archive/documentation/Performance/Conceptual/ManagingMemory/Articles/MallocDebug.html

These custom hooks make the system libraries interfere with memray because both are trying to patch malloc in incompatible ways. As apple has an advantage over memray, they can patch some system pointers that we cannot and forces us to not be able to stop recursion when the memray thread local variables are initialised.

The solution is basically not mess with that library from memray to avoid interfering.

The crash only happens if you are using something that ends needing to link to the system framework libraries that can use that capability, that's why you are only seeing the crash in your application and not on other python scripts.

tsiq-karold commented 11 months ago

Awesome, thanks. Yes, I can work with this for now.

Thank you for the timely help.