FoldingAtHome / fah-issues

49 stars 9 forks source link

FAH Core Process Causing OS X Kernel Panic #1543

Open guywmartin opened 4 years ago

guywmartin commented 4 years ago

Your issue may already be reported! Please search on the issue tracker before creating one.

Your Environment


Expected Behavior

FAH Core shouldn't be causing kernel panics which crash my whole iMac Pro.


Current Behavior

Kernel panic and system reboot:

Kernel panic log upon reboot:

Click to expand! mp_kdp_enter() timed-out on cpu 10, NMI-ing mp_kdp_enter() NMI pending on cpus: 0 1 2 3 4 5 6 7 8 9 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 mp_kdp_enter() timed-out during locked wait after NMI;expected 36 acks but received 1 after 7854836 loops in 1152000000 ticks panic(cpu 10 caller 0xffffff800c533b08): "Ticket spinlock timeout; start: 0x2c6930c36360, end: 0x2c69756d8360, current: 0x2c69756dc716, lock: 0xffffff800cea1780, *lock: 0xb7, waiting for 0xc4, pre-NMI owner: 0, current owner: 0, owner CPU: 0xffffffff"@/AppleInternal/BuildRoot/Library/Caches/com.apple.xbs/Sources/xnu/xnu-6153.121.2/osfmk/kern/tlock.c:162 Backtrace (CPU 10), Frame : Return Address 0xffffffd40572baf0 : 0xffffff800c51f5cd 0xffffffd40572bb40 : 0xffffff800c658b05 0xffffffd40572bb80 : 0xffffff800c64a68e 0xffffffd40572bbd0 : 0xffffff800c4c5a40 0xffffffd40572bbf0 : 0xffffff800c51ec97 0xffffffd40572bcf0 : 0xffffff800c51f087 0xffffffd40572bd40 : 0xffffff800ccc27cc 0xffffffd40572bdb0 : 0xffffff800c533b08 0xffffffd40572be20 : 0xffffff800c540588 0xffffffd40572be90 : 0xffffff800c53fc16 0xffffffd40572bee0 : 0xffffff800c54c43e 0xffffffd40572bef0 : 0xffffff800c54c4cd 0xffffffd40572bf00 : 0xffffff800c630d05 0xffffffd40572bfa0 : 0xffffff800c4c6226 BSD process name corresponding to current thread: FahCore_a7 Boot args: chunklist-security-epoch=0 -chunklist-no-rev2-dev Mac OS version: 19F101 Kernel version: Darwin Kernel Version 19.5.0: Tue May 26 20:41:44 PDT 2020; root:xnu-6153.121.2~2/RELEASE_X86_64 Kernel UUID: 54F1A78D-6F41-32BD-BFED-4381F9F6E2EF Kernel slide: 0x000000000c200000 Kernel text base: 0xffffff800c400000 __HIB text base: 0xffffff800c300000 System model name: iMacPro1,1 (Mac-7BA5B2D9E42DDD94) System shutdown begun: NO System uptime in nanoseconds: 21127524394543 last loaded kext at 19938280017523: com.paragon-software.lvm-for-mac 1 (addr 0xffffff7f906d1000, size 159744) last unloaded kext at 20077439290647: com.paragon-software.lvm-for-mac 1 (addr 0xffffff7f906d1000, size 94208) loaded kexts: com.vidyo.driver.VidyoCamera 1.0.0d1 com.paragon-software.filesystems.extfs 30.3.11 @filesystems.afpfs 11.2 @nke.asp-tcp 8.1 @kext.AMDRadeonServiceManager 3.0.9 >!AUpstreamUserClient 3.6.8 @kext.AMDFramebuffer 3.0.9 @kext.AMDRadeonX5000 3.0.9 >!AGraphicsDevicePolicy 5.2.4 @fileutil 20.036.15 @AGDCPluginDisplayMetrics 5.2.4 >!ATopCaseHIDEventDriver 3430.1 >!AHV 1 |IOUserEthernet 1.0.1 |IO!BSerialManager 7.0.5f6 >AGPM 111.4.4 >!APlatformEnabler 2.7.0d0 >X86PlatformShim 1.0.0 >pmtelemetry 1 @Dont_Steal_Mac_OS_X 7.0.0 @kext.AMD10000!C 3.0.9 >BridgeAudioCommunication 6.70.7 >!AThunderboltIP 3.1.4 >!AHIDALSService 1 >!AGFXHDA 100.1.428 >!ABridgeAudio!C 6.70.7 >!A!ISlowAdaptiveClocking 4.0.0 >!A!IPCHPMC 2.0.1 >!AAVEBridge 6.1 >!AMCCSControl 1.14 >!A!IMCEReporter 115 @filesystems.autofs 3.0 >!UCardReader 489.120.1 >BCMWLANFirmware4355.Hashstore 1 >BCMWLANFirmware4364.Hashstore 1 >BCMWLANFirmware4377.Hashstore 1 @filesystems.apfs 1412.120.2 @filesystems.hfs.kext 522.100.5 @BootCache 40 @!AFSCompression.!AFSCompressionTypeDataless 1.0.0d1 @!AFSCompression.!AFSCompressionTypeZlib 1.0.0 >!AVirtIO 1.0 >!ABCMWLANBusInterfacePCIe 1 @private.KextAudit 1.0 >!AACPIButtons 6.1 >!ASMBIOS 2.1 >!AACPIEC 6.1 >!AAPIC 1.7 $!AImage4 1 @nke.applicationfirewall 303 $TMSafetyNet 8 @!ASystemPolicy 2.0.0 |EndpointSecurity 1 $SecureRemotePassword 1.0 @kext.AMDRadeonX5100HWLibs 1.0 |IOAccelerator!F2 438.5.4 @kext.AMDRadeonX5000HWServices 3.0.9 >!AHIDKeyboard 209 |IOVideo!F 1.2.1 |IOStream!F 1.1.0 >!AMultitouchDriver 3440.1 >!AInputDeviceSupport 3440.8 >!AHS!BDriver 3430.1 >IO!BHIDDriver 7.0.5f6 |IOAVB!F 850.1 >!ASSE 1.0 @!AGPUWrangler 5.2.4 >!UAudio 323.1 |IONDRVSupport 576.1 @kext.AMDSupport 3.0.9 >X86PlatformPlugin 1.0.0 |IO!BHost!CUARTTransport 7.0.5f6 |IO!BHost!CTransport 7.0.5f6 >!A!ILpssUARTv1 3.0.60 >!A!ILpssUARTCommon 3.0.60 >!AOnboardSerial 1.0 |IOSlowAdaptiveClocking!F 1.0.0 >IOPlatformPlugin!F 6.0.0d8 >!ASMBus!C 1.0.18d1 @!AGraphicsDeviceControl 5.2.4 |IOGraphics!F 576.1 >!AGraphicsControl 5.2.4 @plugin.IOgPTPPlugin 840.3 |IOEthernetAVB!C 1.1.0 @kext.triggers 1.0 >usb.cdc.ncm 5.0.0 >usb.cdc 5.0.0 >usb.networking 5.0.0 >usb.!UHostCompositeDevice 1.2 >usb.!UVHCIBCE 1.2 >usb.!UVHCI 1.2 >usb.!UVHCICommonBCE 1.0 >usb.!UVHCICommon 1.0 >!AEffaceableNOR 1.0 |IOSurface 269.11 |IOBufferCopy!C 1.1.0 |IOBufferCopyEngine!F 1 @filesystems.hfs.encodings.kext 1 |IONVMe!F 2.1.0 |IOAudio!F 300.2 @vecLib.kext 1.2.0 >!ABCMWLANCore 1.0.0 >IOImageLoader 1.0.0 |IOSerial!F 11 |IO80211!FV2 1200.12.2b1 >corecapture 1.0.4 |IOSkywalk!F 1 >!AThunderboltDPInAdapter 6.2.6 >!AThunderboltDPAdapter!F 6.2.6 >!AHPM 3.4.4 >!A!ILpssI2C!C 3.0.60 >!A!ILpssDmac 3.0.60 >!A!ILpssI2C 3.0.60 >!AThunderboltPCIDownAdapter 2.5.4 >!AThunderboltNHI 5.8.6 |IOThunderbolt!F 7.6.1 |IOUSB!F 900.4.2 >!AEthernetAquantiaAqtion 1.0.64 >mDNSOffloadUserClient 1.0.1b8 >usb.!UXHCIPCI 1.2 >usb.!UXHCI 1.2 >!AEFINVRAM 2.1 >!AEFIRuntime 2.1 >!ASMCRTC 1.0 |IOSMBus!F 1.1 |IOHID!F 2.0.0 $quarantine 4 $sandbox 300.0 @kext.!AMatch 1.0.0d1 >!AKeyStore 2 >!UTDM 489.120.1 |IOSCSIBlockCommandsDevice 422.120.3 >!ACredentialManager 1.0 >KernelRelayHost 1 >!ASEPManager 1.0.1 >IOSlaveProcessor 1 >!AFDEKeyStore 28.30 >!AEffaceable!S 1.0 >!AMobileFileIntegrity 1.0.5 @kext.CoreTrust 1 |CoreAnalytics!F 1 |IOTimeSync!F 840.3 |IONetworking!F 3.4 >DiskImages 493.0.0 |IO!B!F 7.0.5f6 |IO!BPacketLogger 7.0.5f6 |IOUSBMass!SDriver 157.121.1 |IOSCSIArchitectureModel!F 422.120.3 |IO!S!F 2.1 |IOUSBHost!F 1.2 >usb.!UCommon 1.0 >!UHostMergeProperties 1.2 >!ABusPower!C 1.0 |IOReport!F 47 >!AACPIPlatform 6.1 >!ASMC 3.1.9 >watchdog 1 |IOPCI!F 2.9 |IOACPI!F 1.4 @kec.pthread 1 @kec.corecrypto 1.0 @kec.Libm 1

Possible Solution (Optional)

No immediate possible solution but will be pausing Folding on this node (36 core iMac Pro) until this is resolved - I'm still folding on another 16 core iMac and 4 core Macbook Air.


Steps To Reproduce

Unknown how to reproduce at this time


Context

Context from FAH Core logs:

Click to expand! *********************** Log Started 2020-07-16T00:01:18Z *********************** 00:01:18:Trying to access database... 00:01:18:Successfully acquired database lock 00:01:18:Read GPUs.txt 00:01:18:Enabled folding slot 00: READY cpu:36 00:01:18:ERROR:Failed to register for display power changes 00:01:18:****************************** FAHClient ****************************** 00:01:18: Version: 7.6.13 00:01:18: Author: Joseph Coffland 00:01:18: Copyright: 2020 foldingathome.org 00:01:18: Homepage: https://foldingathome.org/ 00:01:18: Date: Apr 27 2020 00:01:18: Time: 21:20:45 00:01:18: Revision: 5a652817f46116b6e135503af97f18e094414e3b 00:01:18: Branch: master 00:01:18: Compiler: GNU 4.2.1 Compatible Apple LLVM 11.0.0 (clang-1100.0.33.8) 00:01:18: Options: -std=c++11 -O3 -funroll-loops -mmacosx-version-min=10.7 00:01:18: -Wno-unused-local-typedefs -stdlib=libc++ 00:01:18: Platform: darwin 19.2.0 00:01:18: Bits: 64 00:01:18: Mode: Release 00:01:18: Config: /Library/Application Support/FAHClient/config.xml 00:01:18:******************************** CBang ******************************** 00:01:18: Date: Apr 24 2020 00:01:18: Time: 17:07:50 00:01:18: Revision: ea081a3b3b0f4a37c4d0440b4f1bc184197c7797 00:01:18: Branch: master 00:01:18: Compiler: GNU 4.2.1 Compatible Apple LLVM 11.0.0 (clang-1100.0.33.8) 00:01:18: Options: -std=c++11 -O3 -funroll-loops -mmacosx-version-min=10.7 00:01:18: -Wno-unused-local-typedefs -stdlib=libc++ -fPIC 00:01:18: Platform: darwin 19.2.0 00:01:18: Bits: 64 00:01:18: Mode: Release 00:01:18:******************************* System ******************************** 00:01:18: CPU: Intel(R) Xeon(R) W-2191B CPU @ 2.30GHz 00:01:18: CPU ID: GenuineIntel Family 6 Model 85 Stepping 4 00:01:18: CPUs: 36 00:01:18: Memory: 256.00GiB 00:01:18:Free Memory: 248.33GiB 00:01:18: Threads: POSIX_THREADS 00:01:18: OS Version: 10.15 00:01:18:Has Battery: false 00:01:18: On Battery: false 00:01:18: UTC Offset: -7 00:01:18: PID: 120 00:01:18: CWD: /Library/Application Support/FAHClient 00:01:18: OS: Darwin 19.5.0 x86_64 00:01:18: OS Arch: AMD64 00:01:18: GPUs: 0 00:01:18: CUDA: Not detected: Failed to open dynamic library 'libcuda.dylib': 00:01:18: dlopen(libcuda.dylib, 1): image not found 00:01:18: OpenCL: Not detected: Failed to open dynamic library 'libOpenCL.dylib': 00:01:18: dlopen(libOpenCL.dylib, 1): image not found 00:01:18:******************************* libFAH ******************************** 00:01:18: Date: Apr 15 2020 00:01:18: Time: 14:43:28 00:01:18: Revision: 216968bc7025029c841ed6e36e81a03a316890d3 00:01:18: Branch: master 00:01:18: Compiler: GNU 4.2.1 Compatible Apple LLVM 11.0.0 (clang-1100.0.33.8) 00:01:18: Options: -std=c++11 -O3 -funroll-loops -mmacosx-version-min=10.7 00:01:18: -Wno-unused-local-typedefs -stdlib=libc++ 00:01:18: Platform: darwin 19.2.0 00:01:18: Bits: 64 00:01:18: Mode: Release 00:01:18:*********************************************************************** 00:01:18: 00:01:18: 00:01:18: 00:01:18: 00:01:18: 00:01:18: 00:01:18: 00:01:18: 00:01:18: 00:01:18: 00:01:18: 00:01:18: 00:01:18: 00:01:18: 00:01:18: 00:01:18: 00:01:18: 00:01:18: 00:01:18: 00:01:18:WU01:FS00:Starting 00:01:18:WU01:FS00:Running FahCore: /usr/local/bin/FAHCoreWrapper "/Library/Application Support/FAHClient/cores/cores.foldingathome.org/osx/64bit-avx-256/a7-0.0.19/Core_a7.fah/FahCore_a7" -dir 01 -suffix 01 -version 706 -lifeline 120 -checkpoint 15 -np 36 00:01:18:WU01:FS00:Started FahCore on PID 278 00:01:18:WU01:FS00:Core PID:279 00:01:18:WU01:FS00:FahCore 0xa7 started 00:01:18:WU01:FS00:0xa7:*********************** Log Started 2020-07-16T00:01:18Z *********************** 00:01:18:WU01:FS00:0xa7:************************** Gromacs Folding@home Core *************************** 00:01:18:WU01:FS00:0xa7: Type: 0xa7 00:01:18:WU01:FS00:0xa7: Core: Gromacs 00:01:18:WU01:FS00:0xa7: Args: -dir 01 -suffix 01 -version 706 -lifeline 278 -checkpoint 15 -np 36 00:01:18:WU01:FS00:0xa7:************************************ CBang ************************************* 00:01:18:WU01:FS00:0xa7: Date: Nov 27 2019 00:01:18:WU01:FS00:0xa7: Time: 03:27:01 00:01:18:WU01:FS00:0xa7: Revision: d25803215b59272441049dfa05a0a9bf7a6e3c48 00:01:18:WU01:FS00:0xa7: Branch: master 00:01:18:WU01:FS00:0xa7: Compiler: GNU 4.2.1 Compatible Apple LLVM 11.0.0 (clang-1100.0.33.8) 00:01:18:WU01:FS00:0xa7: Options: -std=c++11 -O3 -funroll-loops -mmacosx-version-min=10.7 00:01:18:WU01:FS00:0xa7: -Wno-unused-local-typedefs -stdlib=libc++ -fPIC 00:01:18:WU01:FS00:0xa7: Platform: darwin 19.0.0 00:01:18:WU01:FS00:0xa7: Bits: 64 00:01:18:WU01:FS00:0xa7: Mode: Release 00:01:18:WU01:FS00:0xa7:************************************ System ************************************ 00:01:18:WU01:FS00:0xa7: CPU: Intel(R) Xeon(R) W-2191B CPU @ 2.30GHz 00:01:18:WU01:FS00:0xa7: CPU ID: GenuineIntel Family 6 Model 85 Stepping 4 00:01:18:WU01:FS00:0xa7: CPUs: 36 00:01:18:WU01:FS00:0xa7: Memory: 256.00GiB 00:01:18:WU01:FS00:0xa7:Free Memory: 248.30GiB 00:01:18:WU01:FS00:0xa7: Threads: POSIX_THREADS 00:01:18:WU01:FS00:0xa7: OS Version: 10.15 00:01:18:WU01:FS00:0xa7:Has Battery: false 00:01:18:WU01:FS00:0xa7: On Battery: false 00:01:18:WU01:FS00:0xa7: UTC Offset: -7 00:01:18:WU01:FS00:0xa7: PID: 279 00:01:18:WU01:FS00:0xa7: CWD: /Library/Application Support/FAHClient/work 00:01:18:WU01:FS00:0xa7:******************************** Build - libFAH ******************************** 00:01:18:WU01:FS00:0xa7: Version: 0.0.19 00:01:18:WU01:FS00:0xa7: Author: Joseph Coffland 00:01:18:WU01:FS00:0xa7: Copyright: 2019 foldingathome.org 00:01:18:WU01:FS00:0xa7: Homepage: https://foldingathome.org/ 00:01:18:WU01:FS00:0xa7: Date: Nov 25 2019 00:01:18:WU01:FS00:0xa7: Time: 16:41:59 00:01:18:WU01:FS00:0xa7: Revision: d5b5c747532224f986b7cd02c968ed9a20c16d6e 00:01:18:WU01:FS00:0xa7: Branch: master 00:01:18:WU01:FS00:0xa7: Compiler: GNU 4.2.1 Compatible Apple LLVM 11.0.0 (clang-1100.0.33.8) 00:01:18:WU01:FS00:0xa7: Options: -std=c++11 -O3 -funroll-loops -mmacosx-version-min=10.7 00:01:18:WU01:FS00:0xa7: -Wno-unused-local-typedefs -stdlib=libc++ 00:01:18:WU01:FS00:0xa7: Platform: darwin 19.0.0 00:01:18:WU01:FS00:0xa7: Bits: 64 00:01:18:WU01:FS00:0xa7: Mode: Release 00:01:18:WU01:FS00:0xa7:************************************ Build ************************************* 00:01:18:WU01:FS00:0xa7: SIMD: avx_256 00:01:18:WU01:FS00:0xa7:******************************************************************************** 00:01:18:WU01:FS00:0xa7:Project: 14811 (Run 1924, Clone 0, Gen 126) 00:01:18:WU01:FS00:0xa7:Unit: 0x00000092455e42065ec19b57d65be96f 00:01:18:WU01:FS00:0xa7:Digital signatures verified 00:01:18:WU01:FS00:0xa7:Calling: mdrun -s frame126.tpr -o frame126.trr -cpt 15 -nt 36 00:01:18:WU01:FS00:0xa7:Steps: first=0 total=250000 00:01:21:WU01:FS00:0xa7:Completed 1 out of 250000 steps (0%) 00:01:46:WU01:FS00:0xa7:Completed 2500 out of 250000 steps (1%) 00:02:15:WU01:FS00:0xa7:Completed 5000 out of 250000 steps (2%) 00:02:42:WU01:FS00:0xa7:Completed 7500 out of 250000 steps (3%) 00:03:06:WU01:FS00:0xa7:Completed 10000 out of 250000 steps (4%) 00:03:40:WU01:FS00:0xa7:Completed 12500 out of 250000 steps (5%) 00:04:09:WU01:FS00:0xa7:Completed 15000 out of 250000 steps (6%) 00:04:35:WU01:FS00:0xa7:Completed 17500 out of 250000 steps (7%) 00:05:01:WU01:FS00:0xa7:Completed 20000 out of 250000 steps (8%) 00:05:29:WU01:FS00:0xa7:Completed 22500 out of 250000 steps (9%) 00:05:56:WU01:FS00:0xa7:Completed 25000 out of 250000 steps (10%) 00:06:21:WU01:FS00:0xa7:Completed 27500 out of 250000 steps (11%) 00:06:46:WU01:FS00:0xa7:Completed 30000 out of 250000 steps (12%) 00:07:11:WU01:FS00:0xa7:Completed 32500 out of 250000 steps (13%) 00:07:35:WU01:FS00:0xa7:Completed 35000 out of 250000 steps (14%) 00:07:59:WU01:FS00:0xa7:Completed 37500 out of 250000 steps (15%) 00:08:16:FS00:Paused 00:08:16:FS00:Shutting core down 00:08:17:WU01:FS00:0xa7:Caught signal SIGINT(2) on PID 279 00:08:17:WU01:FS00:0xa7:Exiting, please wait. . . 00:08:22:WU01:FS00:0xa7:Folding@home Core Shutdown: INTERRUPTED 00:08:22:WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66) 00:08:25:Removing old file 'configs/config-20200504-210829.xml' 00:08:25:Saving configuration to config.xml 00:08:25: 00:08:25: 00:08:25: 00:08:25: 00:08:25: 00:08:25: 00:08:25: 00:08:25: 00:08:25: 00:08:25: 00:08:25: 00:08:25: 00:08:25: 00:08:25: 00:08:25: 00:08:25: 00:08:25: 00:08:25: 00:08:25: 00:08:25: 00:08:25:

bb30994 commented 4 years ago

OS-X does not support GPU folding. (That's why you didn't have a GPU configured automatically.) Apple has their share of policies that have blocked FAH, including their deprecation of OpenCL. You can, however, fold with your CPU.

Are you the Guy Martin I know from Ogden?

guywmartin commented 4 years ago

Hi Bruce,

I'm not sure if I'm the same Guy Martin you know - I've only been in Oregon a couple of years.

I'm a bit confused by your response though - I do understand that GPU folding isn't supported, and in fact, this FAHCore client has been folding quite a bit on this machine (36 core iMac Pro) with only the CPU, as well as my wife's 16 core iMac and the 4 core Macbook Air I have running as well.

I have a single FAHControl instance running on this machine that remotely controls the instances on my wife's iMac and my Macbook Air, and they haven't had issues. I don't know if this is something in the WU or something else, but the FAHCore process on my iMac Pro has kernel panicked twice in the last two days, which is why I filed the bug (and paused the folding on this machine).

Thanks.

kbernhagen commented 3 years ago

Does the kernel panic go away if you reduce the slot cpu count to 32?

guywmartin commented 3 years ago

No idea, but I can try that and see… what I’ve found is that I have to pause the fold while I have something like the Zoom client running, and then I usually don’t see the issue.


Guy Martin guy.w.martin@gmail.com

On Nov 23, 2020, at 2:12 PM, Kevin Bernhagen notifications@github.com wrote:

Does the kernel panic go away if you reduce the slot cpu count to 32?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/FoldingAtHome/fah-issues/issues/1543#issuecomment-732455861, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAK27GLQJIQTSAJ2AKGEHTLSRLM5TANCNFSM4O3GWSKQ.