Closed bubbleguuum closed 3 years ago
...
[ 183.705062] corefreqk: disagrees about version of symbol module_layout
[ 411.623529] corefreqk: loading out-of-tree module taints kernel.
[ 411.623632] corefreqk: module verification failed: signature and/or required key missing - tainting kernel
[ 411.676056] CoreFreq(4:10): Processor [ 06_9E] Architecture [Coffee Lake/H] SMT [12/12]
...
[ 411.676453] caller SKL_IMC+0xf4/0x120 [corefreqk] mapping multiple BARs
...
[ 496.489387] CoreFreq: Suspend
...
So "CoreFreq: Resume" is missing from the log which means either "CoreFreqK_Resume()" was not invoked nor called or crashed. https://github.com/cyring/CoreFreq/blob/1887c7b554288e3e9e1f359c657c94a68b832be7/corefreqk.c#L13937
But was the Kernel registration successful prio entering a STR transition: what state do you get as the "PCI enablement" ?
"PCI enablement" is set to ON.
CoreFreqK_Resume
is mentioned in the bt.txt backtrace, so I'm assuming it is called and the kernel is crashing before the printk line ?
"PCI enablement" is set to ON.
CoreFreqK_Resume
is mentioned in the bt.txt backtrace, so I'm assuming it is called and the kernel is crashing before the printk line ?
Yes, just noticed that. Currently testing on my Intel Westmere. But I am intrigued by this cold page.
I can give any additional info you may need as I have the kernel crash dump.
Can not reproduced (3 times) with my W3690
In corefreqk.c
can you replace the following function then rebuild and reload and test suspend.
(this will avoid PCI probing during the resume transition)
static int CoreFreqK_Resume(struct device *dev)
{ /* Probe Processor again */
if (Arch[PUBLIC(RO(Proc))->ArchID].Query != NULL) {
Arch[PUBLIC(RO(Proc))->ArchID].Query(PUBLIC(RO(Proc))->Service.Core);
}
/* Probe PCI again
if (PUBLIC(RO(Proc))->Registration.PCI) {
PUBLIC(RO(Proc))->Registration.PCI = CoreFreqK_ProbePCI() == 0;
} */
Controller_Start(1);
#ifdef CONFIG_CPU_FREQ
Policy_Aggregate_Turbo();
#endif /* CONFIG_CPU_FREQ */
BITSET(BUS_LOCK, PUBLIC(RW(Proc))->OS.Signal, NTFY); /* Notify Daemon*/
printk(KERN_NOTICE "CoreFreq: Resume\n");
return (0);
}
Confirming that it does not crash with these lines commented out.
Confirming that it does not crash with these lines commented out.
Thanks, I have an idea where to look for.
Meanwhile, what do you get from corefreq-cli -M
Cannon Point [3EC4]
Controller #0 Dual Channel
Bus Rate 8000 MT/s Bus Speed 7974 MT/s DRAM Speed 2667 MHz
Cha CL RCD RP RAS RRD RFC WR RTPr WTPr FAW B2B CWL CMD REFI
#0 19 19 19 43 0 467 0 10 42 0 0 17 2T 10400
#1 19 19 19 43 0 467 0 10 42 0 0 17 2T 10400
ddWR drWR srWR ddRW drRW srRW ddRR drRR srRR ddWW drWW srWW CKE ECC
#0 0 0 0 0 0 0 0 0 0 0 0 0 4 0
#1 0 0 0 0 0 0 0 0 0 0 0 0 4 0
DIMM Geometry for channel #0
Slot Bank Rank Rows Columns Memory Size (MB)
#0 16 1 65536 1024 8192
DIMM Geometry for channel #1
Slot Bank Rank Rows Columns Memory Size (MB)
#0 16 1 65536 1024 8192
Also the output of lspci:
00:00.0 Host bridge: Intel Corporation 8th Gen Core Processor Host Bridge/DRAM Registers (rev 07)
00:01.0 PCI bridge: Intel Corporation 6th-9th Gen Core Processor PCIe Controller (x16) (rev 07)
00:02.0 VGA compatible controller: Intel Corporation UHD Graphics 630 (Mobile)
00:04.0 Signal processing controller: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Thermal Subsystem (rev 07)
00:08.0 System peripheral: Intel Corporation Xeon E3-1200 v5/v6 / E3-1500 v5 / 6th/7th/8th Gen Core Processor Gaussian Mixture Model
00:12.0 Signal processing controller: Intel Corporation Cannon Lake PCH Thermal Controller (rev 10)
00:14.0 USB controller: Intel Corporation Cannon Lake PCH USB 3.1 xHCI Host Controller (rev 10)
00:14.2 RAM memory: Intel Corporation Cannon Lake PCH Shared SRAM (rev 10)
00:14.3 Network controller: Intel Corporation Wireless-AC 9560 [Jefferson Peak] (rev 10)
00:15.0 Serial bus controller [0c80]: Intel Corporation Cannon Lake PCH Serial IO I2C Controller #0 (rev 10)
00:16.0 Communication controller: Intel Corporation Cannon Lake PCH HECI Controller (rev 10)
00:16.3 Serial controller: Intel Corporation Cannon Lake PCH Active Management Technology - SOL (rev 10)
00:17.0 SATA controller: Intel Corporation Cannon Lake Mobile PCH SATA AHCI Controller (rev 10)
00:1c.0 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #1 (rev f0)
00:1c.7 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #8 (rev f0)
00:1d.0 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #9 (rev f0)
00:1e.0 Communication controller: Intel Corporation Cannon Lake PCH Serial IO UART Host Controller (rev 10)
00:1f.0 ISA bridge: Intel Corporation Cannon Lake LPC Controller (rev 10)
00:1f.3 Audio device: Intel Corporation Cannon Lake PCH cAVS (rev 10)
00:1f.4 SMBus: Intel Corporation Cannon Lake PCH SMBus Controller (rev 10)
00:1f.5 Serial bus controller [0c80]: Intel Corporation Cannon Lake PCH SPI Controller (rev 10)
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (7) I219-LM (rev 10)
01:00.0 VGA compatible controller: NVIDIA Corporation GP107GLM [Quadro P600 Mobile] (rev a1)
04:00.0 PCI bridge: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 4C 2018] (rev 06)
05:00.0 PCI bridge: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 4C 2018] (rev 06)
05:01.0 PCI bridge: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 4C 2018] (rev 06)
05:02.0 PCI bridge: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 4C 2018] (rev 06)
05:04.0 PCI bridge: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 4C 2018] (rev 06)
06:00.0 System peripheral: Intel Corporation JHL7540 Thunderbolt 3 NHI [Titan Ridge 4C 2018] (rev 06)
07:00.0 PCI bridge: Intel Corporation DSL5520 Thunderbolt 2 Bridge [Falcon Ridge 4C 2013]
08:00.0 PCI bridge: Intel Corporation DSL5520 Thunderbolt 2 Bridge [Falcon Ridge 4C 2013]
08:01.0 PCI bridge: Intel Corporation DSL5520 Thunderbolt 2 Bridge [Falcon Ridge 4C 2013]
08:04.0 PCI bridge: Intel Corporation DSL5520 Thunderbolt 2 Bridge [Falcon Ridge 4C 2013]
08:05.0 PCI bridge: Intel Corporation DSL5520 Thunderbolt 2 Bridge [Falcon Ridge 4C 2013]
09:00.0 USB controller: Fresco Logic FL1100 USB 3.0 Host Controller (rev 10)
0a:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network Connection (rev 03)
3a:00.0 USB controller: Intel Corporation JHL7540 Thunderbolt 3 USB Controller [Titan Ridge 4C 2018] (rev 06)
70:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS525A PCI Express Card Reader (rev 01)
71:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
Damn, the initial probing looks good but not during the resume stage.
Some traces I never got, showing in your dump:
[ 183.705062] corefreqk: disagrees about version of symbol module_layout
...
[ 411.676453] caller SKL_IMC+0xf4/0x120 [corefreqk] mapping multiple BARs
Especially the last one which presumes that something is tracking the PCI access. I can't tell yet if there is a conflict but I don't use kdump because of this old issue
The "disagrees" line may be caused by the dmesg I sent corresponding to a 5.9.1 debug kernel (see first line of the dmesg). for which I recompiled the module. I do not get that "disagrees" line on the normal non-debug 5.9.1 kernel. In any case that should not matter as it crashes the same of the debug and non-debug kernels. Here's the dmesg lines when the module is loaded on the regular kernel, on early startup:
[ 6.680347] corefreqk: module verification failed: signature and/or required key missing - tainting kernel
[ 6.740202] CoreFreq(1:7): Processor [ 06_9E] Architecture [Coffee Lake/H] SMT [12/12]
[ 6.740665] resource sanity check: requesting [mem 0xfed10000-0xfed17fff], which spans more than pnp 00:08 [mem 0xfed10000-0xfed13fff]
[ 6.740674] caller SKL_IMC+0xf4/0x120 [corefreqk] mapping multiple BARs
Do you get a device from:
lspci -nn | grep -i 3ec4
EDIT: I would rather get the output of the following command to check if another driver is conflicting on the same PCI ids
lspci -nn -k
00:00.0 Host bridge [0600]: Intel Corporation 8th Gen Core Processor Host Bridge/DRAM Registers [8086:3ec4] (rev 07)
Subsystem: Lenovo Device [17aa:2269]
Kernel driver in use: skl_uncore
00:01.0 PCI bridge [0604]: Intel Corporation 6th-9th Gen Core Processor PCIe Controller (x16) [8086:1901] (rev 07)
Kernel driver in use: pcieport
00:02.0 VGA compatible controller [0300]: Intel Corporation UHD Graphics 630 (Mobile) [8086:3e9b]
Subsystem: Lenovo Device [17aa:2269]
Kernel driver in use: i915
Kernel modules: i915
00:04.0 Signal processing controller [1180]: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Thermal Subsystem [8086:1903] (rev 07)
Subsystem: Lenovo Device [17aa:2269]
Kernel driver in use: proc_thermal
Kernel modules: processor_thermal_device
00:08.0 System peripheral [0880]: Intel Corporation Xeon E3-1200 v5/v6 / E3-1500 v5 / 6th/7th/8th Gen Core Processor Gaussian Mixture Model [8086:1911]
Subsystem: Lenovo Device [17aa:2269]
00:12.0 Signal processing controller [1180]: Intel Corporation Cannon Lake PCH Thermal Controller [8086:a379] (rev 10)
Subsystem: Lenovo Device [17aa:2269]
Kernel driver in use: intel_pch_thermal
Kernel modules: intel_pch_thermal
00:14.0 USB controller [0c03]: Intel Corporation Cannon Lake PCH USB 3.1 xHCI Host Controller [8086:a36d] (rev 10)
Subsystem: Lenovo Device [17aa:2269]
Kernel driver in use: xhci_hcd
Kernel modules: xhci_pci
00:14.2 RAM memory [0500]: Intel Corporation Cannon Lake PCH Shared SRAM [8086:a36f] (rev 10)
Subsystem: Lenovo Device [17aa:2269]
00:14.3 Network controller [0280]: Intel Corporation Wireless-AC 9560 [Jefferson Peak] [8086:a370] (rev 10)
Subsystem: Intel Corporation Device [8086:0030]
Kernel driver in use: iwlwifi
Kernel modules: iwlwifi
00:15.0 Serial bus controller [0c80]: Intel Corporation Cannon Lake PCH Serial IO I2C Controller #0 [8086:a368] (rev 10)
Subsystem: Lenovo Device [17aa:2269]
Kernel driver in use: intel-lpss
Kernel modules: intel_lpss_pci
00:16.0 Communication controller [0780]: Intel Corporation Cannon Lake PCH HECI Controller [8086:a360] (rev 10)
Subsystem: Lenovo Device [17aa:2269]
Kernel driver in use: mei_me
Kernel modules: mei_me
00:16.3 Serial controller [0700]: Intel Corporation Cannon Lake PCH Active Management Technology - SOL [8086:a363] (rev 10)
Subsystem: Lenovo Device [17aa:2269]
Kernel driver in use: serial
00:17.0 SATA controller [0106]: Intel Corporation Cannon Lake Mobile PCH SATA AHCI Controller [8086:a353] (rev 10)
Subsystem: Lenovo Device [17aa:2269]
Kernel driver in use: ahci
00:1c.0 PCI bridge [0604]: Intel Corporation Cannon Lake PCH PCI Express Root Port #1 [8086:a338] (rev f0)
Kernel driver in use: pcieport
00:1c.7 PCI bridge [0604]: Intel Corporation Cannon Lake PCH PCI Express Root Port #8 [8086:a33f] (rev f0)
Kernel driver in use: pcieport
00:1d.0 PCI bridge [0604]: Intel Corporation Cannon Lake PCH PCI Express Root Port #9 [8086:a330] (rev f0)
Kernel driver in use: pcieport
00:1e.0 Communication controller [0780]: Intel Corporation Cannon Lake PCH Serial IO UART Host Controller [8086:a328] (rev 10)
Subsystem: Lenovo Device [17aa:2269]
Kernel driver in use: intel-lpss
Kernel modules: intel_lpss_pci
00:1f.0 ISA bridge [0601]: Intel Corporation Cannon Lake LPC Controller [8086:a30e] (rev 10)
Subsystem: Lenovo Device [17aa:2269]
00:1f.3 Audio device [0403]: Intel Corporation Cannon Lake PCH cAVS [8086:a348] (rev 10)
Subsystem: Lenovo Device [17aa:2269]
Kernel driver in use: snd_hda_intel
Kernel modules: snd_hda_intel, snd_soc_skl, snd_sof_pci
00:1f.4 SMBus [0c05]: Intel Corporation Cannon Lake PCH SMBus Controller [8086:a323] (rev 10)
Subsystem: Lenovo Device [17aa:2269]
Kernel driver in use: i801_smbus
Kernel modules: i2c_i801
00:1f.5 Serial bus controller [0c80]: Intel Corporation Cannon Lake PCH SPI Controller [8086:a324] (rev 10)
Subsystem: Lenovo Device [17aa:2269]
00:1f.6 Ethernet controller [0200]: Intel Corporation Ethernet Connection (7) I219-LM [8086:15bb] (rev 10)
Subsystem: Lenovo Device [17aa:225f]
Kernel driver in use: e1000e
Kernel modules: e1000e
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP107GLM [Quadro P600 Mobile] [10de:1cbc] (rev a1)
Subsystem: Lenovo Device [17aa:2269]
Kernel driver in use: nvidia
Kernel modules: nouveau, nvidia_drm, nvidia
04:00.0 PCI bridge [0604]: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 4C 2018] [8086:15ea] (rev 06)
Kernel driver in use: pcieport
05:00.0 PCI bridge [0604]: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 4C 2018] [8086:15ea] (rev 06)
Kernel driver in use: pcieport
05:01.0 PCI bridge [0604]: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 4C 2018] [8086:15ea] (rev 06)
Kernel driver in use: pcieport
05:02.0 PCI bridge [0604]: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 4C 2018] [8086:15ea] (rev 06)
Kernel driver in use: pcieport
05:04.0 PCI bridge [0604]: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 4C 2018] [8086:15ea] (rev 06)
Kernel driver in use: pcieport
06:00.0 System peripheral [0880]: Intel Corporation JHL7540 Thunderbolt 3 NHI [Titan Ridge 4C 2018] [8086:15eb] (rev 06)
Subsystem: Lenovo Device [17aa:2269]
Kernel driver in use: thunderbolt
Kernel modules: thunderbolt
07:00.0 PCI bridge [0604]: Intel Corporation DSL5520 Thunderbolt 2 Bridge [Falcon Ridge 4C 2013] [8086:156d]
Kernel driver in use: pcieport
08:00.0 PCI bridge [0604]: Intel Corporation DSL5520 Thunderbolt 2 Bridge [Falcon Ridge 4C 2013] [8086:156d]
Kernel driver in use: pcieport
08:01.0 PCI bridge [0604]: Intel Corporation DSL5520 Thunderbolt 2 Bridge [Falcon Ridge 4C 2013] [8086:156d]
Kernel driver in use: pcieport
08:04.0 PCI bridge [0604]: Intel Corporation DSL5520 Thunderbolt 2 Bridge [Falcon Ridge 4C 2013] [8086:156d]
Kernel driver in use: pcieport
08:05.0 PCI bridge [0604]: Intel Corporation DSL5520 Thunderbolt 2 Bridge [Falcon Ridge 4C 2013] [8086:156d]
Kernel driver in use: pcieport
09:00.0 USB controller [0c03]: Fresco Logic FL1100 USB 3.0 Host Controller [1b73:1100] (rev 10)
Subsystem: Device [1cfa:0002]
Kernel driver in use: xhci_hcd
Kernel modules: xhci_pci
0a:00.0 Ethernet controller [0200]: Intel Corporation I210 Gigabit Network Connection [8086:1533] (rev 03)
Kernel driver in use: igb
Kernel modules: igb
3a:00.0 USB controller [0c03]: Intel Corporation JHL7540 Thunderbolt 3 USB Controller [Titan Ridge 4C 2018] [8086:15ec] (rev 06)
Subsystem: Lenovo Device [17aa:2269]
Kernel driver in use: xhci_hcd
Kernel modules: xhci_pci
70:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS525A PCI Express Card Reader [10ec:525a] (rev 01)
Subsystem: Lenovo Device [17aa:2269]
Kernel driver in use: rtsx_pci
Kernel modules: rtsx_pci
71:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983 [144d:a808]
Subsystem: Samsung Electronics Co Ltd Device [144d:a801]
Kernel driver in use: nvme
Kernel modules: nvme
Reading PCI is fine in CoreFreq , but failed when resuming, and I wonder if one of the above modules has not changed the hardware context.
I'm still studying their code for a potential conflict, but it would help if you can boot with most of them blacklisted, next try suspend with CoreFreq. If no crash happens, it will help us to narrow the list of conflicts.
Still crashing with this epic module exclusion list, passed to the kernel command line:
module_blacklist=skl_uncore,proc_thermal,intel_pch_thermal,iwlwifi,intel_lpss,thunderbolt,nvidia,i2c_i801,snd_hda_intel,igb,rtsx_pci,mei_me,e1000e,serial
This results in these drivers loaded:
Kernel driver in use: skl_uncore
Kernel driver in use: pcieport
Kernel driver in use: i915
Kernel driver in use: proc_thermal
Kernel driver in use: xhci_hcd
Kernel driver in use: serial
Kernel driver in use: ahci
Kernel driver in use: pcieport
Kernel driver in use: pcieport
Kernel driver in use: pcieport
Kernel driver in use: pcieport
Kernel driver in use: pcieport
Kernel driver in use: pcieport
Kernel driver in use: pcieport
Kernel driver in use: pcieport
Kernel driver in use: pcieport
Kernel driver in use: pcieport
Kernel driver in use: pcieport
Kernel driver in use: pcieport
Kernel driver in use: pcieport
Kernel driver in use: xhci_hcd
Kernel driver in use: xhci_hcd
Kernel driver in use: nvme
Some internal drivers like serial and proc_thermal cannot be excluded. nvme and i915 are required to boot at all. Not sure about the remaining others but would not be surprised they are required (let me know).
Could finally blacklist i915 but still crashing.
I mentionned kernel 4.9.1 (in the title and other messages) by mistake: it is kernel 5.9.1 of course.
I mentionned kernel 4.9.1 (in the title and other messages) by mistake: it is kernel 5.9.1 of course.
I noticed that and tomorrow I will upgrade my Intel PC because it might be a Kernel change I don't see with the AMD one.
Thank you for blacklisting most drivers out of the problem. I'm aware it's a pain to achieve that.
Coming back to you ASAP.
So I successfully suspend/resume an i7-8850H
Same signature as yours:
CoreFreq(0:6): Processor [ 06_9E] Architecture [Coffee Lake/H] SMT [12/12]
But it does not mean there is no bug: the fact that commenting CoreFreqK_ProbePCI()
allows you to do a STR.
I'm just booting a plain ArchLinux, kernel version 5.9.2
, in which I have added the line below into the initramfs
to resume video.
# /etc/mkinitcpio.conf
MODULES=(intel_lpss_pci)
My setup might more permissive with the underlying issue ; despite I don't find any fault into the kernel log.
My suspicion goes to the function Query_Turbo_TDP_Config()
that I would like you comment in corefreqk.c
EDIT: from the original source file.
void Query_SKL_IMC(void __iomem *mchmap)
{ /*Source: 6th & 7th Generation Intel® Processor for S-Platforms Vol 2*/
unsigned short cha;
PUBLIC(RO(Proc))->Uncore.CtrlCount = 1;
/* Intra channel configuration */
PUBLIC(RO(Proc))->Uncore.MC[0].SKL.MADCH.value = readl(mchmap + 0x5000);
if (PUBLIC(RO(Proc))->Uncore.MC[0].SKL.MADCH.CH_L_MAP)
{
PUBLIC(RO(Proc))->Uncore.MC[0].SKL.MADC0.value = readl(mchmap + 0x5008);
PUBLIC(RO(Proc))->Uncore.MC[0].SKL.MADC1.value = readl(mchmap + 0x5004);
} else {
PUBLIC(RO(Proc))->Uncore.MC[0].SKL.MADC0.value = readl(mchmap + 0x5004);
PUBLIC(RO(Proc))->Uncore.MC[0].SKL.MADC1.value = readl(mchmap + 0x5008);
}
/* DIMM parameters */
PUBLIC(RO(Proc))->Uncore.MC[0].SKL.MADD0.value = readl(mchmap + 0x500c);
PUBLIC(RO(Proc))->Uncore.MC[0].SKL.MADD1.value = readl(mchmap + 0x5010);
/* Sum up any present DIMM per channel. */
PUBLIC(RO(Proc))->Uncore.MC[0].ChannelCount = \
((PUBLIC(RO(Proc))->Uncore.MC[0].SKL.MADD0.Dimm_L_Size != 0)
|| (PUBLIC(RO(Proc))->Uncore.MC[0].SKL.MADD0.Dimm_S_Size != 0))
+ ((PUBLIC(RO(Proc))->Uncore.MC[0].SKL.MADD1.Dimm_L_Size != 0)
|| (PUBLIC(RO(Proc))->Uncore.MC[0].SKL.MADD1.Dimm_S_Size != 0));
/* Max of populated DIMMs L and DIMMs S */
PUBLIC(RO(Proc))->Uncore.MC[0].SlotCount = KMAX(
(1 + PUBLIC(RO(Proc))->Uncore.MC[0].SKL.MADC0.Dimm_L_Map),
(1 + PUBLIC(RO(Proc))->Uncore.MC[0].SKL.MADC1.Dimm_L_Map)
);
for (cha = 0; cha < PUBLIC(RO(Proc))->Uncore.MC[0].ChannelCount; cha++)
{
PUBLIC(RO(Proc))->Uncore.MC[0].Channel[cha].SKL.Timing.value = \
readl(mchmap + 0x4000 + 0x400 * cha);
PUBLIC(RO(Proc))->Uncore.MC[0].Channel[cha].SKL.Sched.value = \
readl(mchmap + 0x401c + 0x400 * cha);
PUBLIC(RO(Proc))->Uncore.MC[0].Channel[cha].SKL.ODT.value = \
readl(mchmap + 0x4070 + 0x400 * cha);
PUBLIC(RO(Proc))->Uncore.MC[0].Channel[cha].SKL.Refresh.value = \
readl(mchmap + 0x423c + 0x400 * cha);
}
/*
Query_Turbo_TDP_Config(mchmap);
*/
}
systemctl suspend
Thank you
I tested with most modules blacklisted and Query_Turbo_TDP_Config
commented but it still crashes.
I'm going to check if it also crashes on kernel 5.8.15, from a previous openSUSE TW snapshot.
I tested with most modules blacklisted and
Query_Turbo_TDP_Config
commented but it still crashes.I'm going to check if it also crashes on kernel 5.8.15, from a previous openSUSE TW snapshot.
In case if does not crash, I will be interested in the /proc/config
of the Suse kernel 5.9.1
It may include build directives to examine for compliancy, especially paging management, regarding the backtrack you posted.
Confirming that 5.8.15 does not crash (using unmodified corefreqk.c). I wanted to try with 5.10rc1 but it crashes on boot on my system.
Attached the /proc/config of 5.9.1
I could get 5.10-rc1 booting (blacklisting module iwlwifi
making it crash on boot) and it has the same problem than 5.9.1
I could get 5.10-rc1 booting (blacklisting module
iwlwifi
making it crash on boot) and it has the same problem than 5.9.1
Looking for Suse specifics: build directives, patches, security, selinux, and so on because I wonder if you build your own 5.9.1 with vanilla settings, you will encounter a crash ? ArchLinux is indeed trying to stick to the mainstream kernel.
I tried the 5.9.1 vanilla kernel provided in the TW repo (description says it has no SUSE patches) and it also cause the crash. However the RPM spec file for it is still really complex and who knows what happens here...
I'll probably ask SUSE kernel developers about this crash, as they may have an idea what cause it.
I tried the 5.9.1 vanilla kernel provided in the TW repo (description says it has no SUSE patches) and it also cause the crash. However the RPM spec file for it is still really complex and who knows what happens here...
I'll probably ask SUSE kernel developers about this crash, as they may have an idea what cause it.
Damn, I was sure it will make a difference. I wonder if it will crash with another distribution on your PC. Using Arch + kernel 5.9.1 on a Dell laptop w/ Coffee Lake, I can't reproduce your crash. I have tried many times, different sleep delays; also keeping the Cli opened through a ssh on another PC: Resume works as expected, and the monitoring continues correctly.
The issue could be the PCI devices. Sounds like devices can be addressed during driver init but not when transitioning back from S3 : locked, hidden by firmware ?
It's very well possible this is hardware specific. However, 5.8.15 works so maybe a weird kernel bug introduced in 5.9 ? I will try a live CD of a distro using 5.9.x to see how it goes and report. I don't think it is worth to spend more time on this currently and at this stage of what we know, unless other users report it.
More news on this.
On my desktop PC (Core i7-8700K), I ran a live USB stick with openSUSE TW with Kernel 5.9.1 and... it did not crash on resume (with corefreqk module loaded of course). So it tell us this is likely a system specific crash. dmesg output on the PC did not have these suspicious lines that I have on my crashing Thinkpad P72, thus it is very likely that these lines are related to the crash:
[ 411.676445] resource sanity check: requesting [mem 0xfed10000-0xfed17fff], which spans more than pnp 00:08 [mem 0xfed10000-0xfed13fff]
[ 411.676453] caller SKL_IMC+0xf4/0x120 [corefreqk] mapping multiple BARs
More news on this.
On my desktop PC (Core i7-8700K), I ran a live USB stick with openSUSE TW with Kernel 5.9.1 and... it did not crash on resume (with corefreqk module loaded of course). So it tell us this is likely a system specific crash. dmesg output on the PC did not have these suspicious lines that I have on my crashing Thinkpad P72, thus it is very likely that these lines are related to the crash:
[ 411.676445] resource sanity check: requesting [mem 0xfed10000-0xfed17fff], which spans more than pnp 00:08 [mem 0xfed10000-0xfed13fff] [ 411.676453] caller SKL_IMC+0xf4/0x120 [corefreqk] mapping multiple BARs
Thank you very much for these tests. I have also been puzzled by those messages: kernel or one of its driver has the answer b/c something was trapped and warned. This could also be the result of the kdump tracking. So it's a matter of finding the lines of code, and the reversed call flow which leads to them. I'll search ASAP.
Other thoughts about PCI devices hiding that some BIOS can provide as an option. But why would it be only hidden during Resume: thus it's wrong clue or a BIOS or EC bug.
I would say we can leave Suse alone, but this version is unexpectly triggering something different on your hardware.
Some time ago I tried wih Kdump disabled and it crashed the same.
There is anything else I can do. Mainstream Kernel is doing OK with CoreFreq resume function. Please check if Suse updates are fixing the issue. Regards, CyrIng
Running Kernel 5.9.1 on openSUSE Tumbleweed on a Core i7-8850H laptop. Version of CoreFreq: clone of the master branch on October 28th.
If (and only if) the corefreqk module is loaded, if if I suspend my laptop and resume, I 100% have a kernel crash. This never happens if the module is not loaded.
Hopefully I use Kdump. I have attached the backtrace given by the bt command of
crash
(bt.txt) and crash dmesg (dmesg.txt).corefreqk appears several times in the backtrace, especially
CoreFreqK_Resume
. Weirdly, when the kernel crashes and Kdump takes over, there is a different backtrace whose top function isintel_psr_enable
but this one is not logged in the attached dmesg nor thecrash
backtrace. I think this is kind of irrelevant though and just a consequence of a problem caused by corefreqk.Also, when inserting the module, there are some dmesg warnings (resource sanity check) that is also the last line of the attached crash dmesg.txt:
[ 183.705062] corefreqk: disagrees about version of symbol module_layout [ 411.623529] corefreqk: loading out-of-tree module taints kernel. [ 411.623632] corefreqk: module verification failed: signature and/or required key missing - tainting kernel [ 411.676056] CoreFreq(4:10): Processor [ 06_9E] Architecture [Coffee Lake/H] SMT [12/12] [ 411.676445] resource sanity check: requesting [mem 0xfed10000-0xfed17fff], which spans more than pnp 00:08 [mem 0xfed10000-0xfed13fff] [ 411.676453] caller SKL_IMC+0xf4/0x120 [corefreqk] mapping multiple BARs
...
Hello,
In the develop
branch, I'm providing a programming fix relative to Symbols.
Can you tell if you still encounter the issue with it ?
Victory, the develop
branch does not crash anymore (kernel 5.9.12) !
Also tested current master
branch to make sure the fix is really in develop
, and master
still crashes as expected.
Congrats for fixing this one !
Running Kernel 5.9.1 on openSUSE Tumbleweed on a Core i7-8850H laptop. Version of CoreFreq: clone of the master branch on October 28th.
If (and only if) the corefreqk module is loaded, if if I suspend my laptop and resume, I 100% have a kernel crash. This never happens if the module is not loaded.
Hopefully I use Kdump. I have attached the backtrace given by the bt command of
crash
(bt.txt) and crash dmesg (dmesg.txt).corefreqk appears several times in the backtrace, especially
CoreFreqK_Resume
. Weirdly, when the kernel crashes and Kdump takes over, there is a different backtrace whose top function isintel_psr_enable
but this one is not logged in the attached dmesg nor thecrash
backtrace. I think this is kind of irrelevant though and just a consequence of a problem caused by corefreqk.Also, when inserting the module, there are some dmesg warnings (resource sanity check) that is also the last line of the attached crash dmesg.txt:
bt.txt dmesg.txt