cyring / CoreFreq

CoreFreq : CPU monitoring and tuning software designed for 64-bit processors.
https://www.cyring.fr
GNU General Public License v2.0
1.97k stars 126 forks source link

[Solved] System Hard Locks Inserting corefreqk.ko (Intel Atom 330) #304

Closed svmlegacy closed 2 years ago

svmlegacy commented 2 years ago

Clean make of main branch. Inserting corefreqk.ko module results in hard lock of this system, even num lock frozen. Have also seen this issue on select Intel ES processors, on unreleased steppings.

cyring commented 2 years ago

Clean make of main branch. Inserting corefreqk.ko module results in hard lock of this system, even num lock frozen.

Atom 330 of Diamondville has a CPUID of 06_1C

https://github.com/cyring/CoreFreq/blob/478eee81930e1c339f13787a17d7d0ffe2231e2d/corefreqk.h#L1304

Was it running with older versions of CoreFreq ?

If not, comment out or remove those lines:

https://github.com/cyring/CoreFreq/blob/478eee81930e1c339f13787a17d7d0ffe2231e2d/corefreqk.c#L2316

https://github.com/cyring/CoreFreq/blob/478eee81930e1c339f13787a17d7d0ffe2231e2d/corefreqk.c#L7487

... then rebuild and try.

Have also seen this issue on select Intel ES processors, on unreleased steppings.

ES, which CPUID and Brand strings are they ?

svmlegacy commented 2 years ago

Intel Atom 330, CPUID 106C2h (06_1C stepping 2) is correct.

Was it running with older versions of CoreFreq ? If not, comment out or remove those lines: ... then rebuild and try.

Unfortunately still hard-locking. This is the first chance I've had to run this system. Do you have a suggested older version to try?

ES, which CPUID and Brand strings are they ?

The two that I've tried are as follows:

Unsure if it's related, always chocked it up to them being early ES's. They hardlock in the exact same manner, so added it as a piece of info.

cyring commented 2 years ago

Unfortunately still hard-locking. This is the first chance I've had to run this system. Do you have a suggested older version to try?

Do you have any kernel log or screenshot of the backtracked functions and registers dump ?

ES, which CPUID and Brand strings are they ?

The two that I've tried are as follows:

CPUID signature 06_1A and 06_1F are both implemented into CoreFreq , respectively _Nehalem_Bloomfield and _Nehalem_MB

Probably those zeros in the brand string Genuine Intel(R) CPU @ 0000 @ 1.87GHz lead the driver to a division error.

For testings, the line bellow can be commented and replaced with a static value: https://github.com/cyring/CoreFreq/blob/478eee81930e1c339f13787a17d7d0ffe2231e2d/corefreqk.c#L1017

/*
    iArg->Features->Factory.Freq = Intel_Brand( iArg->Features->Info.Brand,
                            iArg->Brand );
*/
    iArg->Features->Factory.Freq = 1870;
cyring commented 2 years ago

@svmlegacy : Please let me know about results with suggested code above and Atom 330 crash screen.

svmlegacy commented 2 years ago

@svmlegacy : Please let me know about results with suggested code above and Atom 330 crash screen.

Still trying to get any kind of debugging info out. Hard lock occurs before any outputs. Trying to get debugging out to a secondary PC via the COM port, but so far only getting a garbled mess. Will let you know when I have something useful.

cyring commented 2 years ago

@svmlegacy : Please let me know about results with suggested code above and Atom 330 crash screen.

Still trying to get any kind of debugging info out. Hard lock occurs before any outputs. Trying to get debugging out to a secondary PC via the COM port, but so far only getting a garbled mess. Will let you know when I have something useful.

About the Atom 330, I would suggest to read the MSR registers happening on the call flow.

Architecture entries are in these lines: https://github.com/cyring/CoreFreq/blob/478eee81930e1c339f13787a17d7d0ffe2231e2d/corefreqk.h#L6507

svmlegacy commented 2 years ago

Somewhat interesting update:

After a full, clean reinstall of Fedora 35 due other unrelated troubles (nvidia 340 drivers breaking the system), When I made corefreq for the shipped kernel 5.14, I got a segmentation fault on inserting corefreqk.ko. Rebooting the system without updating any packages resulted in the hardlock on loading again.

All suggested registers outputted hex code without issue, matching on all cores. I'll submit the actual results of this tommorow.

At this point, I'm debating on switching to another distro, even if Fedora 35 works on other platforms.

cyring commented 2 years ago

At this point, I'm debating on switching to another distro, even if Fedora 35 works on other platforms.

My favorite being ArchLinux, in my Wiki I'm providing CoreFreq live image based on Arch.

New Bottom of the page you'll also find the nightly build with CoreFreq development branch embedded.

Those images also contain the full Arch installation scripts, including Network Manager and its nmtui for easy Network devices setup.

cyring commented 2 years ago

Just to be sure about Nehalem: here is the latest development using the bootable CoreFreq ISO CoreFreq_i7_920_20211219

svmlegacy commented 2 years ago

Update on the Atom 330: Corefreq Arch Linux build also has a kernel panic when loading the module.

Does this build push any information to ttyS0 by default? Still haven't gotten any meaningful information there from the machine at all, but curious if it's worth a try. Kernel panic didn't seem to have much valuable information, but I'll try to get a picture of it in the faulted state.

Will be trying the Nehalem chips after the Atom is sorted... They take up the same workbench :)

cyring commented 2 years ago

Update on the Atom 330: Corefreq Arch Linux build also has a kernel panic when loading the module.

Does this build push any information to ttyS0 by default? Still haven't gotten any meaningful information there from the machine at all, but curious if it's worth a try. Kernel panic didn't seem to have much valuable information, but I'll try to get a picture of it in the faulted state.

Will be trying the Nehalem chips after the Atom is sorted... They take up the same workbench :)

Can you post here the output of command lspci -nn of your Atom 330 and the ES processors ?

Because I would like to check their device DID and the driver callflow consequently. Perhaps some DID are present but the Base Address and CSR registers are not. For exemple, Atom 330 has not VT-d support.

svmlegacy commented 2 years ago

Atom 330 lspci: here

cyring commented 2 years ago

Atom 330 lspci: here

OH! NVidia MCP79 is not implemented yet.

Manufacturer DID 10de is not part of driver yet . It may start with argument:

insmod corefreqk.ko ArchID=<N>

where <N> taken from the generic architectures 0 or 11

https://github.com/cyring/CoreFreq/blob/478eee81930e1c339f13787a17d7d0ffe2231e2d/corefreqk.h#L6096

We will have to program a new loop from scratch. This time, I'll recommend to use the most transparent VM to test and enhance CoreFreq until we feel confident to run bare-metal.

As usual, the key for a good implementation is the NVidia MCP79 datasheet and its registers specification. Googling is showing some documents; kernel source code for that chip is to dig also.

cyring commented 2 years ago

2021-12-21-101849_766x674_scrot

Apparently MSR_PLATFORM_ID is available. First change is to add _Atom_Bonnell in the Intel_MaxBusRatio() function: https://github.com/cyring/CoreFreq/blob/478eee81930e1c339f13787a17d7d0ffe2231e2d/corefreqk.c#L2311

int Intel_MaxBusRatio(PLATFORM_ID *PfID)
{
    struct SIGNATURE whiteList[] = {
        _Core_Conroe,       /* 06_0F */
        _Core_Penryn,       /* 06_17 */
        _Atom_Bonnell,      /* 06_1C */
        _Atom_Silvermont,   /* 06_26 */
        _Atom_Lincroft,     /* 06_27 */
        _Atom_Clover_Trail, /* 06_35 */
        _Atom_Saltwell,     /* 06_36 */
        _Silvermont_Bay_Trail,  /* 06_37 */
        _Atom_Bonnell,      /* 06_1C */
    };
    int id, ids = sizeof(whiteList) / sizeof(whiteList[0]);
    for (id = 0; id < ids; id++) {
        if ((whiteList[id].ExtFamily \
            == PUBLIC(RO(Proc))->Features.Std.EAX.ExtFamily)
         && (whiteList[id].Family \
            == PUBLIC(RO(Proc))->Features.Std.EAX.Family)
         && (whiteList[id].ExtModel \
            == PUBLIC(RO(Proc))->Features.Std.EAX.ExtModel)
         && (whiteList[id].Model \
            == PUBLIC(RO(Proc))->Features.Std.EAX.Model))
        {
            RDMSR((*PfID), MSR_IA32_PLATFORM_ID);
            return 0;
        }
    }
    return -1;
}

Then rebuild, unload, restart all (bare-metal test)

cyring commented 2 years ago

Another request is to check if MSR_PLATFORM_INFO is effectively not supported by Bonnel because it is not listed among architectural list: 2021-12-21-114312_765x417_scrot

whereas we have a go for MSR_IA32_PERF_STATUS 2021-12-21-115213_773x170_scrot

If unsupported please comment out its usage in function Intel_Core_Platform_Info(): https://github.com/cyring/CoreFreq/blob/478eee81930e1c339f13787a17d7d0ffe2231e2d/corefreqk.c#L2341

change function like bellow:

void Intel_Core_Platform_Info(unsigned int cpu)
{
    PLATFORM_ID PfID = {.value = 0};
    PLATFORM_INFO PfInfo = {.value = 0};
    PERF_STATUS PerfStatus = {.value = 0};
    unsigned int ratio0 = 10, ratio1 = 10; /*Arbitrary values*/
/*
    RDMSR(PfInfo, MSR_PLATFORM_INFO);
    if (PfInfo.value != 0) {
        ratio0 = PfInfo.MaxNonTurboRatio;
    }
*/
    RDMSR(PerfStatus, MSR_IA32_PERF_STATUS);
    if (PerfStatus.value != 0) {                /* §18.18.3.4 */
        if (PerfStatus.CORE.XE_Enable) {
            ratio1 = PerfStatus.CORE.MaxBusRatio;
        } else {
            if (Intel_MaxBusRatio(&PfID) == 0) {
                if (PfID.value != 0)
                {
                    ratio1 = PfID.MaxBusRatio;
                }
            }
        }
    } else {
            if (Intel_MaxBusRatio(&PfID) == 0) {
                if (PfID.value != 0)
                {
                    ratio1 = PfID.MaxBusRatio;
                }
            }
    }

    PUBLIC(RO(Core, AT(cpu)))->Boost[BOOST(MIN)] =  KMIN(ratio0, ratio1);
    PUBLIC(RO(Core, AT(cpu)))->Boost[BOOST(MAX)] =  KMAX(ratio0, ratio1);
}
cyring commented 2 years ago

@svmlegacy Hey! any progress with the debugging code requests above ?

cyring commented 2 years ago

@svmlegacy : please let me know when you can contribute on issue.

cyring commented 2 years ago

@svmlegacy Since commit b2f75c89332a1e0ffa517c22895c57c1b91ac812 what about Atom 330 ?

svmlegacy commented 2 years ago

Sorry about the inactivity lately, I'll give it a shot tommorow and see what happens! Thanks for the poke.

svmlegacy commented 2 years ago

All my previous attempts were fruitless, just tried again with the dev version of the archlinux ISO and the current master branch. No luck. Haven't been able to get a serial connection outbound either. Screenshot from 2022-04-06 18-20-26

cyring commented 2 years ago

All my previous attempts were fruitless, just tried again with the dev version of the archlinux ISO and the current master branch. No luck. Haven't been able to get a serial connection outbound either. Screenshot from 2022-04-06 18-20-26

Thanks for trying the develop branch. Don't you have any kernel log (dmesg) to see where the Atom has crashed in the driver callflow ?

svmlegacy commented 2 years ago

Don't you have any kernel log (dmesg) to see where the Atom has crashed in the driver callflow ?

Great point! There is something that changed since last time I was working with this. Before, the system would hard lock, meaning I couldn't pull from dmesg. Now, it seems like it's not causing the system to lock (but still isn't working quite right.)

Here's the dmesg pulled from the system, the the attempted module insertion as the last entries: dmesg.txt .

cyring commented 2 years ago

Don't you have any kernel log (dmesg) to see where the Atom has crashed in the driver callflow ?

Great point! There is something that changed since last time I was working with this. Before, the system would hard lock, meaning I couldn't pull from dmesg. Now, it seems like it's not causing the system to lock (but still isn't working quite right.)

Here's the dmesg pulled from the system, the the attempted module insertion as the last entries: dmesg.txt .

Yes, it started at:

CoreFreq(0:2:-1): Processor [ 06_1C] Architecture [Atom/Bonnell] SMT [4/4]

Can you read this register ?

## MSR_TEMPERATURE_TARGET
rdmsr -ax 0x1A2

if not, please comment that line in the driver code, next rebuild/reload all for testing https://github.com/cyring/CoreFreq/blob/a1540153123db1b2614dcc2d8cddede1be3a42cb/corefreqk.c#L7737

svmlegacy commented 2 years ago

Screenshot from 2022-04-07 20-25-47

Can you read this register ?

## MSR_TEMPERATURE_TARGET
rdmsr -ax 0x1A2

Nope. Could not read that MSR.

Commenting out this line enables the system to insert the mod with no issues. https://github.com/cyring/CoreFreq/blob/a1540153123db1b2614dcc2d8cddede1be3a42cb/corefreqk.c#L7737

Dumped a bunch of info here: https://gist.github.com/svmlegacy/9bd33c5b273e4310f20a3c6c2b288bfe

Wonderful to see progress!

cyring commented 2 years ago

Great to see that screenshot of Bonnell

The last register MSR_TEMPERATURE_TARGET really hurts processor. And we are left without a TjMax which is hard-coded to 100°C We can fine tune TjMax and also the Temperature formula, if you aware of better values for your Processor ?

I'm wrapping up all the code change: other Atom architectures are also impacted by same issue.

cyring commented 2 years ago

@svmlegacy Code changes made so far are available in commit 0794238d5e9bdeae6252dff46f8dd001f5c12294

The monitoring loop for Bonnell is very basic and now need to be affine with architectural MSR registers listed in the SDM specifications at chapter 2.3

2022-04-08-084946_811x144_scrot

cyring commented 2 years ago

And this datasheet also -;)

EDIT: If temperature is not accurate, you can try the integer value of 85 at this code line:

https://github.com/cyring/CoreFreq/blob/0794238d5e9bdeae6252dff46f8dd001f5c12294/corefreqk.c#L8235

2022-04-08-100932_675x163_scrot

https://github.com/cyring/CoreFreq/blob/0794238d5e9bdeae6252dff46f8dd001f5c12294/corefreqk.h#L7115

with:

    .voltageFormula = VOLTAGE_FORMULA_INTEL_SOC,

or:

    .voltageFormula = VOLTAGE_FORMULA_INTEL_SNB,
  1. Rebuild and Run
  2. Set Voltage scope to < SMT> in Settings menu
  3. Change to the view Voltage
svmlegacy commented 2 years ago

Good News! The develop branch now works as-is for the Atom 330.

Reported temperature looks good. Offsetting by another 15°C would put it sub-ambient. Tjmax of 85°C matches what is reported by other utilities.

I tried changing the .voltageFormula with the suggested statements:

https://github.com/cyring/CoreFreq/blob/ed94b48f4adaad30f8c4df7f7f83734f60f1cf03/corefreqk.h#L7172

Neither produced a good result in the SMT scope. _SOC was locked at 0.38V, and _SNB was at 0.0033 V. Expected VID range per the datasheet is 0.7 - 1.2 V.

FYI I have a couple other Bonnell chips that we can use for testing. Intel Atom N270 (32-bit only, Diamondville) Intel Atom N450 (64-bit capable, Pineview)

cyring commented 2 years ago

Good News! The develop branch now works as-is for the Atom 330.

Reported temperature looks good. Offsetting by another 15°C would put it sub-ambient. Tjmax of 85°C matches what is reported by other utilities.

I tried changing the .voltageFormula with the suggested statements:

https://github.com/cyring/CoreFreq/blob/ed94b48f4adaad30f8c4df7f7f83734f60f1cf03/corefreqk.h#L7172

Neither produced a good result in the SMT scope. _SOC was locked at 0.38V, and _SNB was at 0.0033 V. Expected VID range per the datasheet is 0.7 - 1.2 V.

Let's keep this voltage algorithm VOLTAGE_FORMULA_INTEL_SOC but we will adjust the formula here: https://github.com/cyring/CoreFreq/blob/ed94b48f4adaad30f8c4df7f7f83734f60f1cf03/coretypes.h#L614

What we are interested in is this equation: https://github.com/cyring/CoreFreq/blob/ed94b48f4adaad30f8c4df7f7f83734f60f1cf03/coretypes.h#L629 which receives a voltage VID as an input, and outputs the Vcore

In datasheets, most of the time volume 1, we should find the associations table between both. But also some steps and other offsets to apply to the Vcore formula.

Tbc.

FYI I have a couple other Bonnell chips that we can use for testing. Intel Atom N270 (32-bit only, Diamondville) Intel Atom N450 (64-bit capable, Pineview)

32-bits is not supported but I will enjoy the N450.

cyring commented 2 years ago

In datasheet, table 3-2

2022-04-18-044350_632x473_scrot

VID Formula Vcore
1 0 0 1 0 0 1(73) 0.7 + (73.0 - 73.0) * 0.0125 0.7000
1 0 0 1 0 0 0(72) 0.7 + (73.0 - 72.0) * 0.0125 0.7125
0 1 1 0 1 1 0(54) 0.7 + (73.0 - 54.0) * 0.0125 0.9375
0 1 0 0 0 0 1(33) 0.7 + (73.0 - 33.0) * 0.0125 1.2000
cyring commented 2 years ago

My notes

High-k and Metal Gate Transistor Research

HiK-MG-Fig 2 HiK-MG-Fig 1 HiK-MG-Fig 3

2022-04-18-055907_1119x385_scrot

svmlegacy commented 2 years ago

Seems to be pulling a VID value of 27, which according to the formula is a higher than expected voltage for this CPU.

Will verify the VID MSR later tonight, along with potentially a measurement of Vcc at the VRM.

Screenshot from 2022-04-18 05-48-57

cyring commented 2 years ago

Callflow

https://github.com/cyring/CoreFreq/blob/bf8f6d2f1e51f12495358b3e818f8e3590ab9e4a/corefreqk.h#L1491

Atom Bonnel is routed to a compatible Core2 loop :

https://github.com/cyring/CoreFreq/blob/bf8f6d2f1e51f12495358b3e818f8e3590ab9e4a/corefreqk.c#L13601

where VID is read from MSR_IA32_PERF_CTL

https://github.com/cyring/CoreFreq/blob/bf8f6d2f1e51f12495358b3e818f8e3590ab9e4a/corefreqk.c#L13626

MSR is specified for classes of architecture

https://github.com/cyring/CoreFreq/blob/bf8f6d2f1e51f12495358b3e818f8e3590ab9e4a/intelmsr.h#L546

Probably Atom Bonnel is a different bit layout ... Or another MSR to query VID from ?

svmlegacy commented 2 years ago

Intel Atom N450 CoreFreq, lspci, and /proc/cpuinfo

I haven't had luck tracing an appropriate MSR so far. Intel does not do a good job of describing MSR_IA32_PERF_CTL in the software developers manual for these CPU's.

cyring commented 2 years ago

Intel Atom N450 CoreFreq, lspci, and /proc/cpuinfo

I haven't had luck tracing an appropriate MSR so far. Intel does not do a good job of describing MSR_IA32_PERF_CTL in the software developers manual for these CPU's.

You must start driver based on a hard coded BCLK as below:

insmod corefreqk.ko AutoClock=0

Then you can monitor CPU frequencies again.

cyring commented 2 years ago

@svmlegacy

To avoid the side effect of the variant TSC with the Intel Atom N450, I recommend to start CoreFreq with the AutoClock=0 parameter. Please see previous comment.

cyring commented 2 years ago

@svmlegacy : Thinking about TSC, I would like to enhance the driver to let it handle the variant case by itself. I just need your Atom N450 for future code testing; if it's ok for you ?

svmlegacy commented 2 years ago

@svmlegacy : Thinking about TSC, I would like to enhance the driver to let it handle the variant case by itself. I just need your Atom N450 for future code testing; if it's ok for you ?

Atom N450 Results updated here with AutoClock=0 : Intel Atom N450

Yes, no problem to use this machine for future testing. I'll keep it available.

cyring commented 2 years ago

Please let me know, show me, what is missing from last develop branch ?

svmlegacy commented 2 years ago

Hello, sorry about the delay;

Here's the current output for the Atom N450: here

I still need to start the kernel module with AutoClock=0.

cyring commented 2 years ago

Hello, sorry about the delay;

Here's the current output for the Atom N450: here

I still need to start the kernel module with AutoClock=0.

Indeed the AutoClock parameter is not programmed to switch when facing the Variant TSC case. Because those processors are less occurring, I let the User set it to OFF.

There is still some to do, and not the easiest, if feasible:

cyring commented 2 years ago

Hello,

The attached version is an attempt to compute the DIMM geometry on your N450 (Bus rate & speed unit is btw changed to MT/s)

Could you please show me the output of corefreq-cli -M

CoreFreq_develop.tar.gz

cyring commented 2 years ago

About Vcore, I would also need the following outputs from your N450 :

## MSR:IA32_PERF_STATUS
rdmsr -ax 0x198
## MSR:IA32_PERF_CTL
rdmsr -ax 0x199
cyring commented 2 years ago

New code is now available in develop branch. Thank you for any confirmation answer.