Catalina enable-hdmi20 CoreDisplay patch leads to Code Signing crash of WindowServer

lambdaupb commented 3 years ago

see https://github.com/csrutil/DeskMini/issues/10

DeskMini 310, i5-8500 UHD630, Catalina 10.15.7, Opencore 0.6.3

enable-hdmi20 patches CoreDisplay at runtime. When in a High Memory Pressure situation it apparently happens that the CoreDisplay library memory is moved to swap.

When reloading the library memory to RAM, a code signing check is done and fails, causing a WindowServer crash.

I am able to reproduce this by using Prime95 > Torture Test > Large FFTs which allocates almost all of system memory and then doing some UI stuff involving animations etc (~1min).

Possible fixes

document that users need to disable code signing (SIP ?), not sure how to do that
maybe add MAP_RESILIENT_CODESIGN flag to mmap of library/dyld_cache (https://github.com/VirusTotal/yara/issues/1309) - I have 0 clue if that works for executable regions

logs

Process:               WindowServer [5465]
Path:                  /System/Library/PrivateFrameworks/SkyLight.framework/Versions/A/Resources/WindowServer
Identifier:            WindowServer
Version:               600.00 (451.4)
Code Type:             X86-64 (Native)
Parent Process:        launchd [1]
Responsible:           WindowServer [5465]
User ID:               88

PlugIn Path:             /System/Library/Frameworks/CoreDisplay.framework/Versions/A/CoreDisplay
PlugIn Identifier:       com.apple.CoreDisplay
PlugIn Version:          1.0 (186.6.15)

Date/Time:             2020-11-16 19:09:29.410 +0100
OS Version:            Mac OS X 10.15.7 (19H15)
Report Version:        12
Anonymous UUID:        066D0EDF-3DB8-4976-B736-5BD0416F165D

Sleep/Wake UUID:       E94190B2-19CB-47AB-B1AE-97DCA13B6988

Time Awake Since Boot: 150000 seconds
Time Since Wake:       100000 seconds

System Integrity Protection: enabled

Crashed Thread:        0  Dispatch queue: com.apple.main-thread

Exception Type:        EXC_BAD_ACCESS (Code Signature Invalid)
Exception Codes:       0x0000000000000032, 0x00007fff347d72d9
Exception Note:        EXC_CORPSE_NOTIFY

Termination Reason:    Namespace CODESIGNING, Code 0x2

kernel messages:

VM Regions Near 0x7fff347d72d9:
    __TEXT                 00007fff347b8000-00007fff347d7000 [  124K] r-x/r-x SM=COW  /System/Library/Frameworks/CoreDisplay.framework/Versions/A/CoreDisplay
--> __TEXT                 00007fff347d7000-00007fff347d8000 [    4K] r-x/rwx SM=COW  /System/Library/Frameworks/CoreDisplay.framework/Versions/A/CoreDisplay
    Submap                 00007fff347d8000-00007fff40000000 [184.2M] r--/rwx SM=PRV  process-only VM submap

Application Specific Information:
StartTime:2020-11-16 18:31:50
GPU:IG
MetalDevice for accelerator(0x312b): 0x7ff210d29038 (MTLDevice: 0x7ff1e8048000)
IOService:/AppleACPIPlatformExpert/PCI0@0/AppleACPIPCI/IGPU@2/AppleIntelFramebuffer@0

2020-11-17 01:00:58.772582+0100  localhost kernel[0]: CODE SIGNING: process 241[WindowServer]: rejecting invalid page at address 0x7fff330bf000 from offset 0xcfb7000 in file "/private/var/db/dyld/dyld_shared_cache_x86_64h" (cs_mtime:1605366281.472771946 == mtime:1605366281.472771946) (signed:0 validated:0 tainted:0 nx:0 wpmapped:0 dirty:1 depth:2)

lambdaupb commented 3 years ago

/*
 * The MAP_RESILIENT_* flags can be used when the caller wants to map some
 * possibly unreliable memory and be able to access it safely, possibly
 * getting the wrong contents rather than raising any exception.
 * For safety reasons, such mappings have to be read-only (PROT_READ access
 * only).
 *
 * MAP_RESILIENT_CODESIGN:
 *  accessing this mapping will not generate code-signing violations,
 *  even if the contents are tainted.
 * MAP_RESILIENT_MEDIA:
 *  accessing this mapping will not generate an exception if the contents
 *  are not available (unreachable removable or remote media, access beyond
 *  end-of-file, ...).  Missing contents will be replaced with zeroes.
 */
#define MAP_RESILIENT_CODESIGN  0x2000 /* no code-signing failures */
#define MAP_RESILIENT_MEDIA 0x4000 /* no backing-store failures */

Seems that only works for read only mappings.

vit9696 commented 3 years ago

That's very interesting, but I believe we cannot quite remap things here. Instead we should adjust the codesign flags as we already do, but perhaps in a slightly different manner. It may be possible that I missed some for the latest 10.15 version. Could you play with it and try setting/dropping different flags?

CC @usr-sse2 @osy86 @lvs1974 @07151129

al3xtjames commented 3 years ago

Can easily reproduce on 10.14.6 here: run P95 large FFTs until some swapping occurs, and then try to open About This Mac. This should cause WindowServer to crash.

sudo sysctl vm.cs_debug=255 adds some more info:

2020-12-11 19:35:59.509 Df kernel[0:1f4918] vm_fault: signed: no validate: no tainted: no wpmapped: no prot: 0x5
2020-12-11 19:35:59.509 Df kernel[0:1f4918] CODE SIGNING: cs_invalid_page(0x7fff3ad17000): p=38037[WindowServer]
2020-12-11 19:35:59.509 Df kernel[0:1f4918] CODE SIGNING: cs_invalid_page(0x7fff3ad17000): p=38037[WindowServer] final status 0x23007b01, denying page sending SIGKILL
2020-12-11 19:35:59.509 Df kernel[0:1f4918] CODE SIGNING: process 38037[WindowServer]: rejecting invalid page at address 0x7fff3ad17000 from offset 0xb89e000 in file "/private/var/db/dyld/dyld_shared_cache_x86_64h" (cs_mtime:1605723499.64038983 == mtime:1605723499.64038983) (signed:0 validated:0 tainted:0 nx:0 wpmapped:0 dirty:1 depth:2)
2020-12-11 19:35:59.509 Df kernel[0:1f4918] CODESIGNING: vm_fault_enter(0x7fff3ad17000): *** INVALID PAGE ***

sending SIGKILL means that CS_KILL was set (note that cs_invalid_page hasn't changed in 10.15).

lvs1974 commented 3 years ago

@al3xtjames: try to add a boot-arg -liluuseroff.

vit9696 commented 3 years ago

@al3xtjames @lambdaupb could you check whether the offset found by UserPatcher::vmProtect is correct? Because it clearly strips CS_KILL from the process.

lambdaupb commented 3 years ago

I'm not a C programmer and have no real Idea how to do that. If I'm provided with step-by-step instruction, I can repro this though.

This machine is my daily driver at the moment so I'm reluctant to dive into it since my issue was solved by removing the enable-hdmi20 setting.

vit9696 commented 3 years ago

The easiest test is to enable Lilu debug logging and create a debug log in /var/log/Lilu_x.x.x.txt via -liludbgall liludump=60 boot arguments. Upload it here, and perhaps it sheds some light on the issue.

al3xtjames commented 3 years ago

Lilu is using 308 as the offset for p_csflags. Lilu_1.5.1_18.7.txt

stevezhengshiqi commented 3 years ago

@al3xtjames thx a lot for the CoreDisplay fix on weg. Would you mind providing some more information about max-pixel-clock-frequency value? If you have time to update Manual in weg, then will be so nice.

zearp commented 3 years ago

I tried to reproduce on my NUC but couldn't. System becomes laggy but not unresponsive and it doesn't crash or even overheat. CPU usage went up and down, I guess thats part of the Large FFT torture test? I left it running for about 10 minutes whilst browsing Github and opening/closing the about my Mac dialog every now and then. My config can be found here.

As I mentioned here I believe these forced logouts on NUC 8th gens are due to missing ACPI patches and/or the OpenCore configuration used. But thats just my guess since I have no issues and run multiple NUCs. I have stress tested them with stress-ng quite heavily a few months ago. No problems whatsoever, these Kaby Lake NUCs are rock solid with OpenCore for me.

I'm running the latest versions of OpenCore/Lilu/etc and compiling everything from source now but also had no problems when I didn't do that and just used the release versions. Are there any other ways for me to try and reproduce this?

Screenshot 2020-12-20 at 13 41 30

lambdaupb commented 3 years ago

@zearp thank you for your attempt at reproducing this issue!

I think you have SIP disabled with

<key>csr-active-config</key>
<data>/wcAAA==</data>

where /wcAAA== b64 is equal to ff 07 00 00 hex. Which according to Dorthania https://dortania.github.io/OpenCore-Install-Guide/troubleshooting/extended/post-issues.html#disabling-sip

disables all SIP on Mojave / Catalina.

So code signing would be disabled and not kill WindowServer.

zearp commented 3 years ago

@lambdaupb Good point! I have it disabled cuz I use VoltageShift. I just repeated the test with SIP enabled. It did run a little hotter but after ~10 minutes of running Prime95 and opening about this Mac and Launchpad/Notification Centre a bunch of times I didn't get any crash. The fading animation varies from smooth to choppy but nothing grinds to a halt.

I'm thinking that the logouts people experienced on the NUC may have nothing to do with this, which is why I can't reproduce. Unless it also happens to you on a NUC but it seems you're using a different mini computer. I'm only here cuz you mentioned this in a NUC issue I was still subscribed to haha. But I can't seem to reproduce it on my NUCs.

lambdaupb commented 3 years ago

@zearp I have little experience with that setting, but could you check if SIP is really ~~disabled~~ enabled? The dorthania guide mentions it will not overwrite old values in NVRAM unless the property is mentioned in the delete section as well.

Note: Disabling SIP with OpenCore is quite a bit different compared to Clover, specifically that NVRAM variables will not be overwritten unless explicitly told so under the Delete section. So if you've already set SIP once either via OpenCore or in macOS, you must override the variable:
NVRAM -> Block -> 7C436110-AB2A-4BBB-A880-FE41995C9F82 -> csr-active-config

zearp commented 3 years ago

@lambdaupb Yes it was really enabled. I checked with csrutil status after rebooting and reset NVRAM in between boots for good measure. I was also prompted with a bunch of security warnings, those are due voltageShift, Intel Power Gadget and some other kexts I use. So my guess its that it's really turned on. Does this happen to you on a Kaby Lake NUC too or only on your DeskMini?

lambdaupb commented 3 years ago

My deskmini has a Coffee Lake R (I think) i5-8500 CPU.

There might be something else going on as well. The crash report of WindowServer clearly shows a code signing crash on the NUC

https://github.com/appleserial/NUC8I5BEH/issues/13

System Integrity Protection: enabled

Crashed Thread:        0  Dispatch queue: com.apple.main-thread

Exception Type:        EXC_BAD_ACCESS (Code Signature Invalid)
Exception Codes:       0x0000000000000032, 0x00007fff37028253
Exception Note:        EXC_CORPSE_NOTIFY

Termination Reason:    Namespace CODESIGNING, Code 0x2

kernel messages:

VM Regions Near 0x7fff37028253:
    __TEXT                 00007fff37009000-00007fff37028000 [  124K] r-x/r-x SM=COW  /System/Library/Frameworks/CoreDisplay.framework/Versions/A/CoreDisplay
--> __TEXT                 00007fff37028000-00007fff37029000 [    4K] r-x/rwx SM=COW  /System/Library/Frameworks/CoreDisplay.framework/Versions/A/CoreDisplay
    Submap                 00007fff37029000-00007fff40000000 [143.8M] r--/rwx SM=PRV  process-only VM submap

So the issue exists and is fixed by removing enable-hdmi20 for me on 10.15 and @al3xtjames on 10.14.

It might very well be a combination with another setting or ACPI patch that triggers it though.

zearp commented 3 years ago

It might very well be a combination with another setting or ACPI patch that triggers it though.

@lambdaupb Yeah thats my guess too. What I will do is try the EFI from the repo you linked and report back in a bit. When I wrote Kaby Lake I meant Coffee Lake of course. I'm a pro at messing up those Intel codenames, sorry for any confusion it may have caused.

lambdaupb commented 3 years ago

Thanks for the help. I will try to reproduce this issue with opencore updated to 0.6.4 and all other modules updated as well.

vit9696 commented 3 years ago

Let me be clear:

The issue does exist and is specific to Lilu user patcher
Disabling SIP may hide the issue, but is not recommended
@al3xtjames provided an alternative to CDF patches
Lilu user patcher is not supported on 11.x, and that will unlikely change (thus the issue will unlikely be fixed)

zearp commented 3 years ago

@lambdaupb Just ran the same tests using the EFI from the repo you linked and again no crashes, SIP is enabled and the hdmi setting too. I'm thinking these random logouts people experienced on the NUC have nothing to do with this issue, which would explain my failure to reproduce it. But it doesn't mean there is no issue of course. I don't have a DeskMini 310 to play with but it looks like a fun little machine so I hope you can get this sorted.

The issue with the WindowServer crash you linked seems to be solved by a comment on a blog thats linked but I can't read the comment because the comments are not loading for me for some reason. I've not done any upgrading from 10.14.x to 10.15.x and only ever used Catalina and Big Sur on my NUCs. Maybe those crashes were related to the upgrade or something else in their setup? I think this specific issue isn't present on the NUC Coffee Lake models but do let me know if there's anything else I can try.

likaci commented 3 years ago

@lambdaupb Just ran the same tests using the EFI from the repo you linked and again no crashes, SIP is enabled and the hdmi setting too. I'm thinking these random logouts people experienced on the NUC have nothing to do with this issue, which would explain my failure to reproduce it. But it doesn't mean there is no issue of course. I don't have a DeskMini 310 to play with but it looks like a fun little machine so I hope you can get this sorted.

The issue with the WindowServer crash you linked seems to be solved by a comment on a blog thats linked but I can't read the comment because the comments are not loading for me for some reason. I've not done any upgrading from 10.14.x to 10.15.x and only ever used Catalina and Big Sur on my NUCs. Maybe those crashes were related to the upgrade or something else in their setup? I think this specific issue isn't present on the NUC Coffee Lake models but do let me know if there's anything else I can try.

@zearp Hi, I can reproduce WindowServer crash with your EFI and https://github.com/appleserial/NUC8I5BEH 's EFI by running "Large FFTs". And my NUC is upgraded from 10.14 . Can you post the blog link? Thank you.

zearp commented 3 years ago

@likaci You can’t follow the link I referred to and find the blog post yourself? Please don't quote an entire post to only add a sentence.

Try if you can also reproduce it on a system that wasn’t upgraded from 10.14.x because no matter how long I let it run I get no crashes and I directly installed Catalina on mine.

I don’t have a 10.14.x installer laying around to do a clean install with and then upgrade to Catalina but I might try for the fun of it and see if I get crashes that way.

likaci commented 3 years ago

@zearp Sorry for my disturbing and bad english. I have read the entire page but can't find the link that mentioned about upgrad from 10.14 may cause the problem.

I have only one NUC running some services , so I can't reinstall it. I confirmed that Disable SIP or Disable HDMI2.0 can void the problem.

Thank you for your help, Happy new year.

Sher1ocks commented 3 years ago

스크린샷 2021-03-21 오후 11 50 30 I also had this problem in Big Sur. In the Skylake laptop, only the freq of 1.5ghz or more was maintained, and the overheating phenomenon was constantly maintained, leading to poor performance. It was resolved by turning off the enable-hdmi20 option. thank you for tip!

vit9696 commented 3 years ago

enable-hdmi20 is deprecated in favour of max-pixel-clock feature (https://github.com/acidanthera/WhateverGreen/pull/79). Although the issue is not exclusive to CDF side of WEG, userspace patching is implemented differently on Big Sur and above, and is not affected by this issue. I no longer use Catalina or older, and thus decided not to address this issue. Closing.

zearp commented 3 years ago

Does this mean that enable-max-pixel-clock-override replaces the enable-hdmi20 option? Will the option stay or will it be removed in future builds?

Because at the moment removing enable-hdmi20 and replacing it with enable-max-pixel-clock-override breaks 4k on Catalina and earlier.

It seems its not doing the same as the hdmi20 option did. But I may have misunderstood and/or not implemented it properly.

vit9696 commented 3 years ago

You may need higher max-pixel-clock-frequency (in Hz, defaults to 675000000). https://github.com/acidanthera/WhateverGreen/blob/master/Manual/FAQ.IntelHD.en.md#hdmi-in-uhd-resolution-with-60fps

acidanthera / bugtracker

Catalina enable-hdmi20 CoreDisplay patch leads to Code Signing crash of WindowServer #1335

Possible fixes

logs