Closed 0xfeedface-turbo closed 4 years ago
I forgot to mention that I spent a lot of time troubleshooting this before discovering.
Different NVMe cards, different motherboards, NVMe heatsinks, built-in M.2 slots vs PCI adapter cards, UEFI PCI power settings, enable/disable ASPM etc, the kernel panic always reoccurred. Sometimes the VID/PID would read as 0xffff
Onboard PCH IGE, AHCI, USB never had an issue at all, only NVMe. I'm guessing it's some kind of UEFI firmware bug?
That's an extremely curious bug, thanks for suggesting a fix. I think force disabling RC6 by default in the FeatureControl dict of the framebuffer IORegistryEntry is a good immediate solution.
Were you able to isolate the issue just to a single key of this dictionary?
Worth mentioning you can also disable render standby by passing bootarg forceRenderStandby=0
.
Thanks for the tip on the bootarg. I am pretty sure that's it.
It can take hours for the panic to happen, but I set RenderStandby back to 1 and I got a panic almost immediately. I have reverted the previous changes and am testing just with forceRenderStandby=0 right now and it hasn't KP so far.
I am not sure the power impact with this change? This is a desktop system, but the same problem could be happening with laptops. One of the linux posts mentions disabling coarse power gating as the better option. There is a key CoarsePowerGatingSelect but I haven't deduced what the values mean yet.
RenderStandby
refers to RC6, the lowest-power idle render state. It has been notoriously buggy and required workarounds, both in Linux and Windows.
Coarse power gating is another mechanism used in GEN9 to transition Render and Media engines to sleep. The two appear to be independent in principle. The CoarsePowerGatingSelect
bits 0 and 1 are used to enable Render and Media CPG, respectively. An older version of i915 used to disable Render CPG https://patchwork.kernel.org/patch/6193051/, but apparently it is now enabled along with RC6.
Thanks for the info, it has saved me a lot of time!
I did some testing with RenderStandby=1
and CoarsePowerGatingSelect=0
and I was actually able to get the same NVMe crash with the display ON for the first time. Do you know what bit 2 is used for? The default in the CFL FB kext is 4, and disabling that bit seems to make a difference.
Setting forceRenderStandby=0
in boot-args solves the crashes completely.
Intel Power Gadget reports that the IGP frequency never drops below 350mhz and total power consumption is approximately 1W higher than with RenderStandby enabled.
I'm still at a loss as to why RC6 on the IGP would be affecting the NVMe at all, though.
CoarsePowerGatingSelect=4
uses the value from the platform info struct at offset 0x58 (gPlatformInformationList
, see IntelFramebuffer.bt
) to configure CPG:
AppleIntelFramebufferController::getCPGControl
...
cpgsel = OSMetaClassBase::safeMetaCast(v3, OSNumber::metaClass);
if ( cpgsel )
{
cpgsel = (cpgsel->vtbl->unsigned32BitValue)(cpgsel);
if ( cpgsel != 4 )
goto LABEL_7;
this->CoarsePowerGatingSelect = 0;
v4 = this->platformInfo->member22;
cpgsel = (&dword_0 + 2);
if ( _bittest(&v4, 0x10u) )
{
this->CoarsePowerGatingSelect = 1;
cpgsel = (&dword_0 + 3);
}
if ( _bittest(&v4, 0x11u) )
LABEL_7:
this->CoarsePowerGatingSelect = cpgsel;
}
It's a complete mystery why there is interference between GPU and PCI. If you can reproduce it on Linux with i915, then this could be reported to Intel.
By the way, value CSTS=0xffffffff
also looks suspicious according to the spec.
A similar bug in Linux: https://bugs.freedesktop.org/show_bug.cgi?id=108546. Apparently, it is a BIOS issue, although in that case intel_idle.max_cstate=1 i915.enable_dc=0 i915.enable_fbc=0
did not help.
Thanks for your help! Added a comment to WhateverGreen FAQ. Other FAQs will also need to be updated.
CC @Andrey1970AppleLife @khronokernel @PMheart
I added forceRenderStandby=0 boot arg as well , and IGPU is stacked at 0,3ghz.
Maybe this state is when TRIM runs and it is crashing? Try sudo trimforce disable
and reboot. If re-enabling then it is recommended to run disk first aid.
It's back doing it again on my machine after a month or so of no issues
Getting more consistent too
I haven't had this panic since I disabled TRIM
I haven't had this panic since I disabled TRIM
Will try that - thank you!
I haven't had this panic since I disabled TRIM
How did you disable trim? Tried your command but at reboot nvme trim is still enabled. Maybe this command is only for sata3 ssd?
@Mateo1234454545
ThirdPartyDrives
kernel patch is set to False
sudo trimforce disable
SetApfsTrimTimeout
to 999
which is the minimal timeoutI ended up having to do a fresh Big Sur install and restore my install from Time Machine
That all went great and I'm back up and running with no freezes again and I've used @1alessandro1 tips/settings above in hopes that might cure it long term.
I don't think I will really know for a month or so, as that's how long the freezing issue took to reappear after the last time I did all this.
I'll report back in hopes of helping anyone else down the line.
Thank you all
Let me start with the fact that this is not a bug in NVMeFix or Whatevergreen but this seems like the best place to document the issue.
I have an Intel 9600K/H370 system that experiences kernel panics in IONVMeController that manifests as a generic timeout:
I have tried to debug this timeout, which always happens at random times but there is a commonality - it only happens when using the IGP and the display is sleeping.
The IGP going into a low-power mode seems to disrupt power to the NVMe, causing it to crash/reset, and thus causing the timeout. The NVMe keeps smart statistics on power offs, and I have recorded this anomaly:
I have not been able to figure out exactly how the IGP is causing the NVMe to lose power, but I suspect it may be related to this issue (RC6)
I modified the CFL FB kext with these changes, which seems to completely solve the KP issue:
<key>RenderStandby</key><integer>0</integer>
<key>SetRC6Voltage</key><integer>1</integer>
<key>SupportPSRwithExternalDisplay</key><integer>0</integer>
Have you guys seen issues relating to IGP power saving causing any similar problems? I'm thinking there might be a way to work around this in Whatevergreen or NVMeFix to avoid having to create a plist-only kext to change these settings.