acidanthera / bugtracker

Acidanthera Bugtracker
385 stars 45 forks source link

IGP causes NVMe Kernel Panic CSTS=0xffffffff #1193

Closed 0xfeedface-turbo closed 4 years ago

0xfeedface-turbo commented 4 years ago

Let me start with the fact that this is not a bug in NVMeFix or Whatevergreen but this seems like the best place to document the issue.

I have an Intel 9600K/H370 system that experiences kernel panics in IONVMeController that manifests as a generic timeout:

void AppleNVMeRequestTimer::PrintPending()::243:QID=1 Deadline=4390442285091 DW0=00140001 DW10=00F04593 DW11=00000000 DW12=0000001F DW13=00000000 DW14=00000000 DW15=00000000 void AppleNVMeRequestTimer::PrintPending()::243:QID=1 Deadline=4390442285091 DW0=00140001 DW10=00F04593 DW11=00000000 DW12=0000001F DW13=00000000 DW14=00000000 DW15=00000000 Debugger called: IOPlatformPanicAction -> IONVMeController IOPlatformPanicAction -> AppleSMC : panic(cpu 0 caller 0xffffff7f865edb30): nvme: "Fatal error occurred. CSTS=0xffffffff US[1]=0x0 US[0]=0x5a1 VID/DID=0x500215b7 . FW Revision=102000WD\n"@/BuildRoot/Library/Caches/com.apple.xbs/Sources/IONVMeFamily/IONVMeFamily-387.270.1/IONVMeController.cpp:5334 Backtrace (CPU 0), Frame : Return Address 0xffffff873a6f3a10 : 0xffffff8003fad58d mach_kernel : _handle_debugger_trap + 0x47d 0xffffff873a6f3a60 : 0xffffff80040e9145 mach_kernel : _kdp_i386_trap + 0x155 0xffffff873a6f3aa0 : 0xffffff80040da87a mach_kernel : _kernel_trap + 0x50a 0xffffff873a6f3b10 : 0xffffff8003f5a9d0 mach_kernel : _return_from_trap + 0xe0 0xffffff873a6f3b30 : 0xffffff8003facfa7 mach_kernel : _panic_trap_to_debugger + 0x197 0xffffff873a6f3c50 : 0xffffff8003facdf3 mach_kernel : _panic + 0x63 0xffffff873a6f3cc0 : 0xffffff7f865edb30 com.apple.iokit.IONVMeFamily : __ZN16IONVMeController13FatalHandlingEv + 0x10e 0xffffff873a6f3e20 : 0xffffff800465d407 mach_kernel : ZN18IOTimerEventSource15timeoutSignaledEPvS0_ + 0x87 0xffffff873a6f3e90 : 0xffffff800465d329 mach_kernel : ZN18IOTimerEventSource17timeoutAndReleaseEPvS0_ + 0x99 0xffffff873a6f3ec0 : 0xffffff8003fec7a5 mach_kernel : _thread_call_delayed_timer + 0xef5 0xffffff873a6f3f40 : 0xffffff8003fec345 mach_kernel : _thread_call_delayed_timer + 0xa95 0xffffff873a6f3fa0 : 0xffffff8003f5a0ce mach_kernel : _call_continuation + 0x2e Kernel Extensions in backtrace: com.apple.iokit.IONVMeFamily(2.1)[E109699D-6257-3176-B081-4CC8B1C181AB]@0xffffff7f865e0000->0xffffff7f8661ffff dependency: com.apple.driver.AppleMobileFileIntegrity(1.0.5)[1AD7D9F4-24B5-354F-BD01-C301F58FAA52]@0xffffff7f84d8d000 dependency: com.apple.iokit.IOPCIFamily(2.9)[EF12A360-E92B-3407-8080-E4889F8AAC97]@0xffffff7f84895000 dependency: com.apple.driver.AppleEFINVRAM(2.1)[32B99D26-4CD1-3CE5-8856-D2659CCA4861]@0xffffff7f84f67000 dependency: com.apple.iokit.IOStorageFamily(2.1)[DFD9596C-E596-376A-8A00-3B74A06C2D02]@0xffffff7f84b83000 dependency: com.apple.iokit.IOReportFamily(47)[769D4408-2D1B-3B65-89D1-4C3C547099E3]@0xffffff7f85407000 BSD process name corresponding to current thread: kernel_task

I have tried to debug this timeout, which always happens at random times but there is a commonality - it only happens when using the IGP and the display is sleeping.

The IGP going into a low-power mode seems to disrupt power to the NVMe, causing it to crash/reset, and thus causing the timeout. The NVMe keeps smart statistics on power offs, and I have recorded this anomaly:

Power Cycles: 3,814 Power On Hours: 202 Unsafe Shutdowns: 3,794

I have not been able to figure out exactly how the IGP is causing the NVMe to lose power, but I suspect it may be related to this issue (RC6)

I modified the CFL FB kext with these changes, which seems to completely solve the KP issue: <key>RenderStandby</key><integer>0</integer> <key>SetRC6Voltage</key><integer>1</integer> <key>SupportPSRwithExternalDisplay</key><integer>0</integer>

Have you guys seen issues relating to IGP power saving causing any similar problems? I'm thinking there might be a way to work around this in Whatevergreen or NVMeFix to avoid having to create a plist-only kext to change these settings.

0xfeedface-turbo commented 4 years ago

I forgot to mention that I spent a lot of time troubleshooting this before discovering.

Different NVMe cards, different motherboards, NVMe heatsinks, built-in M.2 slots vs PCI adapter cards, UEFI PCI power settings, enable/disable ASPM etc, the kernel panic always reoccurred. Sometimes the VID/PID would read as 0xffff

Onboard PCH IGE, AHCI, USB never had an issue at all, only NVMe. I'm guessing it's some kind of UEFI firmware bug?

07151129 commented 4 years ago

That's an extremely curious bug, thanks for suggesting a fix. I think force disabling RC6 by default in the FeatureControl dict of the framebuffer IORegistryEntry is a good immediate solution.

Were you able to isolate the issue just to a single key of this dictionary?

Worth mentioning you can also disable render standby by passing bootarg forceRenderStandby=0.

0xfeedface-turbo commented 4 years ago

Thanks for the tip on the bootarg. I am pretty sure that's it.

It can take hours for the panic to happen, but I set RenderStandby back to 1 and I got a panic almost immediately. I have reverted the previous changes and am testing just with forceRenderStandby=0 right now and it hasn't KP so far.

I am not sure the power impact with this change? This is a desktop system, but the same problem could be happening with laptops. One of the linux posts mentions disabling coarse power gating as the better option. There is a key CoarsePowerGatingSelect but I haven't deduced what the values mean yet.

07151129 commented 4 years ago

RenderStandby refers to RC6, the lowest-power idle render state. It has been notoriously buggy and required workarounds, both in Linux and Windows.

Coarse power gating is another mechanism used in GEN9 to transition Render and Media engines to sleep. The two appear to be independent in principle. The CoarsePowerGatingSelect bits 0 and 1 are used to enable Render and Media CPG, respectively. An older version of i915 used to disable Render CPG https://patchwork.kernel.org/patch/6193051/, but apparently it is now enabled along with RC6.

0xfeedface-turbo commented 4 years ago

Thanks for the info, it has saved me a lot of time!

I did some testing with RenderStandby=1 and CoarsePowerGatingSelect=0 and I was actually able to get the same NVMe crash with the display ON for the first time. Do you know what bit 2 is used for? The default in the CFL FB kext is 4, and disabling that bit seems to make a difference.

Setting forceRenderStandby=0 in boot-args solves the crashes completely.

Intel Power Gadget reports that the IGP frequency never drops below 350mhz and total power consumption is approximately 1W higher than with RenderStandby enabled.

I'm still at a loss as to why RC6 on the IGP would be affecting the NVMe at all, though.

07151129 commented 4 years ago

CoarsePowerGatingSelect=4 uses the value from the platform info struct at offset 0x58 (gPlatformInformationList, see IntelFramebuffer.bt) to configure CPG:

AppleIntelFramebufferController::getCPGControl
...
    cpgsel = OSMetaClassBase::safeMetaCast(v3, OSNumber::metaClass);
    if ( cpgsel )
    {
      cpgsel = (cpgsel->vtbl->unsigned32BitValue)(cpgsel);
      if ( cpgsel != 4 )
        goto LABEL_7;
      this->CoarsePowerGatingSelect = 0;
      v4 = this->platformInfo->member22;
      cpgsel = (&dword_0 + 2);
      if ( _bittest(&v4, 0x10u) )
      {
        this->CoarsePowerGatingSelect = 1;
        cpgsel = (&dword_0 + 3);
      }
      if ( _bittest(&v4, 0x11u) )
LABEL_7:
        this->CoarsePowerGatingSelect = cpgsel;
    }

It's a complete mystery why there is interference between GPU and PCI. If you can reproduce it on Linux with i915, then this could be reported to Intel.

07151129 commented 4 years ago

By the way, value CSTS=0xffffffff also looks suspicious according to the spec.

A similar bug in Linux: https://bugs.freedesktop.org/show_bug.cgi?id=108546. Apparently, it is a BIOS issue, although in that case intel_idle.max_cstate=1 i915.enable_dc=0 i915.enable_fbc=0 did not help.

vit9696 commented 4 years ago

Thanks for your help! Added a comment to WhateverGreen FAQ. Other FAQs will also need to be updated.

CC @Andrey1970AppleLife @khronokernel @PMheart

Mateo1234454545 commented 4 years ago

I added forceRenderStandby=0 boot arg as well , and IGPU is stacked at 0,3ghz.

malhal commented 4 years ago

Maybe this state is when TRIM runs and it is crashing? Try sudo trimforce disable and reboot. If re-enabling then it is recommended to run disk first aid.

blodt commented 3 years ago

It's back doing it again on my machine after a month or so of no issues

Getting more consistent too

malhal commented 3 years ago

I haven't had this panic since I disabled TRIM

blodt commented 3 years ago

I haven't had this panic since I disabled TRIM

Will try that - thank you!

Mateo1234454545 commented 3 years ago

I haven't had this panic since I disabled TRIM

How did you disable trim? Tried your command but at reboot nvme trim is still enabled. Maybe this command is only for sata3 ssd?

1alessandro1 commented 3 years ago

@Mateo1234454545

blodt commented 3 years ago

I ended up having to do a fresh Big Sur install and restore my install from Time Machine

That all went great and I'm back up and running with no freezes again and I've used @1alessandro1 tips/settings above in hopes that might cure it long term.

I don't think I will really know for a month or so, as that's how long the freezing issue took to reappear after the last time I did all this.

I'll report back in hopes of helping anyone else down the line.

Thank you all