Closed chenkaigithub closed 2 years ago
One of these Intel_DomainPowerLimit()
calls don't work with your Skylake/X
From the develop
branch, can you comment out the whole statement at:
https://github.com/cyring/CoreFreq/blob/6e1d7d414017a24cce69edb24c5586324d0c1fa4/corefreqk.c#L9661
with:
if (Core->Bind == PUBLIC(RO(Proc))->Service.Core) {
/*
Intel_DomainPowerLimit( MSR_PKG_POWER_LIMIT,
PKG_POWER_LIMIT_LOCK_MASK,
PWR_DOMAIN(PKG) );
Intel_DomainPowerLimit( MSR_PP0_POWER_LIMIT,
PPn_POWER_LIMIT_LOCK_MASK,
PWR_DOMAIN(CORES) );
Intel_DomainPowerLimit( MSR_PLATFORM_POWER_LIMIT,
PKG_POWER_LIMIT_LOCK_MASK,
PWR_DOMAIN(PLATFORM) );
Intel_DomainPowerLimit( MSR_DRAM_POWER_LIMIT,
PPn_POWER_LIMIT_LOCK_MASK,
PWR_DOMAIN(RAM) );
*/
Intel_Watchdog(Core);
}
Next rebuild all with make clean all
Make sure Daemon and Driver are fully removed
Now test the built Driver
If things get better then un-comment one after the other each Intel_DomainPowerLimit()
until you encounter the crash again.
You should then be able to tell me which among MSR_DRAM_POWER_LIMIT, MSR_PLATFORM_POWER_LIMIT, MSR_PP0_POWER_LIMIT and MSR_PKG_POWER_LIMIT
is forbidden with your processor ?
EDIT: looking at your RCX
dump, MSR_PLATFORM_POWER_LIMIT
is the faulty one:
https://github.com/cyring/CoreFreq/blob/6e1d7d414017a24cce69edb24c5586324d0c1fa4/intelmsr.h#L207
I think you can just comment out its call.
@cyring Thank you for your quick response. I follow your instruction.
MSR_PLATFORM_POWER_LIMIT and MSR_DRAM_POWER_LIMIT each will cause system crash. Please see the attachment.
vmcore-dmesg(MSR_DRAM_POWER_LIMIT).txt vmcore-dmesg(MSR_PLATFORM_POWER_LIMIT).txt
@cyring Thank you for your quick response. I follow your instruction.
MSR_PLATFORM_POWER_LIMIT and MSR_DRAM_POWER_LIMIT each will cause system crash. Please see the attachment.
vmcore-dmesg(MSR_DRAM_POWER_LIMIT).txt vmcore-dmesg(MSR_PLATFORM_POWER_LIMIT).txt
Well done!
Can you now fetch the develop
branch where is supplied the fix and provide various outputs of CoreFreq like one dedicated page "Xeon Gold 6126" in the Wiki support ?
This page as an example. The output of corefreq-cli -s
will especially help me to chase any remaining bugs or mistakes.
Thank you.
Thank you for your quick fix. The wiki is updated.
Thank you for your quick fix. The wiki is updated.
Thank you for these additions.
First time I can see PPIN#
showing up !
Many other small things to fix or investiguate:
9C
Turbo AUTO < 255 >
is odd but it may come from hardware
The side effect is that the UI ruler is scaled to 255
I/O MMU Version
to queryLowest C-State LIMIT < C0>
and Max C-State Inclusion RANGE < C0>
are both w/o C-statesI am not an expert on the CPU part. I can help you to verify something.
I am not an expert on the CPU part. I can help you to verify something.
I have first to enhance code facing data provided by your Skylake/X. For example, make the UI based on validated ratios; create workarounds of garbage or uninitialized values.
I'll be back with additional testing requests. Thanks for your help. CyrIng
Among latest commits in develop
branch, the UI ruler is now limited to a Max Ratio
which is computed from the Makefile
directive MAX_FREQ_HZ
and the Base Clock
Please give it a try with Turbo boosted single Core.
You can use the integrated tools to stress CPU: Press O
and Turbo Round Robin
to stress CPU one after one.
Accordingly to your screenshot, I expect the drawing to be limited to 37, your highest boosted ratio in 1C
I will check this after Oct 8 for now on holidays.
I will check this after Oct 8 for now on holidays.
Feel free to come back. Regards
Sorry. I am busy these days. Dev branch can not be compiled.
Sorry. I am busy these days. Dev branch can not be compiled.
I was too ambitious to use __auto_type
as saint grale
It is deleted in the last develop
branch commit.
It could be compiled but crash. vmcore-dmesg.txt .
It could be compiled but crash. vmcore-dmesg.txt .
Fix is available in develop
branch although I wonder if during our previous tests, we failed while probing the TCO register ?
EDIT: before pulling the latest commit ff41125ef7a6375668bb45b994125ae2b270dfa8, can you retry but this time, unloading or black-listing the TCO drivers iTCO_wdt
and iTCO_vendor_support
(and module dependencies if needed), prior starting the corefreqk.ko
module.
The IRQ
trapped in the Call trace of your kernel dump is a hint that those drivers may have installed an interrupt handler which conflicts with CoreFreq driver.
EDIT: before pulling the latest commit ff41125, can you retry but this time, unloading or black-listing the TCO drivers
iTCO_wdt
andiTCO_vendor_support
(and module dependencies if needed), prior starting thecorefreqk.ko
module.The
IRQ
trapped in the Call trace of your kernel dump is a hint that those drivers may have installed an interrupt handler which conflicts with CoreFreq driver.
CyrIng, How unloading or black-listing the TCO drivers?
EDIT: before pulling the latest commit ff41125, can you retry but this time, unloading or black-listing the TCO drivers
iTCO_wdt
andiTCO_vendor_support
(and module dependencies if needed), prior starting thecorefreqk.ko
module. TheIRQ
trapped in the Call trace of your kernel dump is a hint that those drivers may have installed an interrupt handler which conflicts with CoreFreq driver.CyrIng, How unloading or black-listing the TCO drivers?
Unloading modules:
modprobe -r
or rmmod
followed by the module name
Blacklisting modules: This prevents drivers to be auto loaded during boot. When you can't unload a module because it is forced resident by kernel, blacklisting is your choice. This is a boot command line argument you will set in your boot loader configuration:
grub.cfg
syslinux.cfg
/EFI/loader/...
Here my arguments for Xeon W3690 This a bare-metal model where CoreFreq will be the kernel governor (Idle-Freq, CPU-Freq)
nmi_watchdog=0 modprobe.blacklist=pcspkr,iTCO_wdt,acpi_cpufreq,pcc_cpufreq,intel_cstate,intel_uncore,intel_powerclamp,i7core_edac,i5500_temp,coretemp,asus_atk0110 idle=halt intel_pstate=disable cpufreq.off=0
None of the modules listed in the blacklist=
should be listed by lsmod
afterwards.
To make CoreFreq the governor, you will ask for registration:
insmod corefreqk.ko RDPMC_Enable=1 Register_ClockSource=1 Register_Governor=1 Register_CPU_Freq=1 Register_CPU_Idle=1
From the UI, menu Settings, you can also do the registration.
In short:
modprobe.blacklist=iTCO_wdt
Warmly recommended to add a new entry in your boot loader config as the BLACKLIST menu choice. Keep the original entry as a safe startup.
Thanks. I can see that there is no Vcore reading.
Vcore for SKL/X is stored at this line: https://github.com/cyring/CoreFreq/blob/220a0bd43508dbed60a51ebc527d7b6eb8db1873/corefreqk.c#L13609
... from register MSR_IA32_PERF_STATUS
at:
https://github.com/cyring/CoreFreq/blob/220a0bd43508dbed60a51ebc527d7b6eb8db1873/corefreqk.c#L13593
So far I have no clue which SKL/X register to read voltage VID from ?
I am not family with the CPU part. You can try other solutions, and I will help you to test them.
Also notice in screenshot that relative frequency and its ratio on stressed CPU 28 are around zero !
First time I'm encountering this issue.
Can you show the Core
view which displays the UCC URC TSC
counters.
Also can you show the Sensors
and the Custom
views.
I received once a CoreFreq execution report from Skylake/X and it was giving much more meaningful values than that.
EDIT:
This i9-9980XE has also a CPUID 06_55
Stepping 4
; but yours is a Xeon.
EDIT:
Looking at the top-left LCD, we can see the frequency of 3678
MHz. But no such frequency in the first column of the lower area (and ratio stuck to 0.00
)
I believe you have to download a fresh copy of the source code (develop
branch) and fully rebuild and reload all before making the screenshots.
UI directives exist to remove some drawing areas, especially when facing a high core count processor.
make help
to list them.
As an example of removing the bars drawing:
make NO_UPPER=1 clean all
You should then be able to watch your whole 48 CPUs, your Terminal height may however still need to be adjusted.
Is this OK now?
Is this OK now?
Yes much better. So no issue on this parr. Thank you.
Core voltage is the enhancement needed for your processor family.
No clue for Vcore.
Closing the issue. Feel free to come back about the new version.
Crash when insmod ./corefreqk.ko