cyring / CoreFreq

CoreFreq : CPU monitoring and tuning software designed for 64-bit processors.
https://www.cyring.fr
GNU General Public License v2.0
1.98k stars 126 forks source link

DKMS compilations fails on 6.3 #422

Closed emansom closed 1 year ago

emansom commented 1 year ago

When using linux-cachyos-rc as kernel, DKMS compilation fails.

/var/lib/dkms/corefreq/1.95.5/build/corefreqk.c: In functie ‘CoreFreqK_mmap’:
/var/lib/dkms/corefreq/1.95.5/build/corefreqk.c:22084:23: fout: assignment of read-only member ‘vm_flags’
22084 |         vma->vm_flags = VM_READ;
      |                       ^
compilatie is beëindigd ten gevolge van -Wfatal-errors.
make[2]: *** [scripts/Makefile.build:252: /var/lib/dkms/corefreq/1.95.5/build/corefreqk.o] Fout 1
make[1]: *** [Makefile:2026: /var/lib/dkms/corefreq/1.95.5/build] Fout 2
make[1]: Map '/usr/lib/modules/6.3.0-rc1-1-cachyos-rc/build' wordt verlaten
make: *** [Makefile:86: all] Fout 2
make: Map '/usr/src/corefreq-1.95.5' wordt verlaten
cyring commented 1 year ago

Thanks

Not CachyOS only but general issue with Linux kernel 6.3.rc1

Work in progress. Stay tuned.

cyring commented 1 year ago

CoreFreq branch linux_6_3 is available for testings.

It builds and runs OK on my 3950X.

Save and sync your file-systems prior launching the driver.

Any code review from anyone is welcomed: The change is touching a security part of CoreFreq to protect as read-only the access to driver pages.

emansom commented 1 year ago

Builds and works perfectly now!

emansom commented 1 year ago

When built with ARCH_PMC=UMC the corefreq-cli TUI errors out on launch:

Daemon connection error code 2
corefreq-ro-shm: 'Bestand of map bestaat niet' @ line 21307

Is there some setup required for UMC?

CPU is Zen 2 3500X, platform X570.

cyring commented 1 year ago

When built with ARCH_PMC=UMC the corefreq-cli TUI errors out on launch:

Daemon connection error code 2
corefreq-ro-shm: 'Bestand of map bestaat niet' @ line 21307

Is there some setup required for UMC?

Make sure to fully rebuild and reload from working directory. No previous instance still running.

make ARCH_PMC=UMC clean all

rmmod corefreqk
insmod ./corefreqk.ko
./corefreqd -d ## check threaded monitoring loops
./corefreq-cli
emansom commented 1 year ago

Daemon is not launching.

$ sudo corefreqd -d
Driver connection error code 13
Version 0.0.0: 'Permission denied' @ line 9213
cyring commented 1 year ago

Daemon is not launching.

Probably driver corefreqk.ko has crashed with ARCH_PMC=UMC despite you have a Zen/Matisse like I do, but mine is a 3950X. Although PMC also works with my other 5300U

If not done yet, please check your kernel log for any trace of CoreFreq driver crash ?

Can you make sure again the standard build (without that PMC) is still running OK ?

cyring commented 1 year ago

EDIT: I believe Daemon issue happens with branch linux_6_3 and not with Linux 6.2 plus master branch ?

cyring commented 1 year ago

I have installed CachyOS for ArchLinux and I can't reproduce the Daemon issue with CoreFreq branch linux_6_3

make ARCH_PMC=UMC clean all

2023-03-11-030730_644x1012_scrot

cyring commented 1 year ago

From Daemon process, I've tried to write one of the read-only mmap pages of corefreqk.ko which has throw a segfault to this user-space process. So it works as expected in 6.3 kernel also. Can't reproduce your Daemon issue.

cyring commented 1 year ago

And I need your acknowledgement if there's no remaining issue before merging into master branch. Thank you.

emansom commented 1 year ago

And I need your acknowledgement if there's no remaining issue before merging into master branch. Thank you.

The computer with the 3500X is my parents' HTPC. I don't have access to it all the time (remotely, reverse tunnel via site-to-site VPN) as it's not always powered on.

I'll be at their place today for some maintenance on the HTPC and try to debug further.

PBO and UEFI settings are configured for low power usage and low thermal envelope, including disablement of cores (2 of the 6 are enabled). I suspect the failures may have something to do with that.

I'll check what dmesg prints after loading the module. Is there any extra compilation flags or module options to increase its verbosity?

cyring commented 1 year ago

I'll check what dmesg prints after loading the module. Is there any extra compilation flags or module options to increase its verbosity?

emansom commented 1 year ago

dmesg will be enough to diagnose corefreqk.ko if any issue. Have also a look in modules list with lsmod to check if the version aligned driver is well loaded.

Only thing it prints is some info, no error.

ewout@enthoo ~ % sudo dmesg | tail
[  230.475320] rfkill: input handler disabled
[  231.205749] input: solaar-keyboard as /devices/virtual/input/input31
[ 1162.036175] logitech-djreceiver 0003:046D:C52B.0004: device of type eQUAD step 4 DJ (0x04) connected on slot 2
[ 1162.058254] input: Logitech K270 as /devices/pci0000:00/0000:00:01.2/0000:02:00.0/0000:03:08.0/0000:07:00.3/usb3/3-1/3-1:1.2/0003:046D:C52B.0004/0003:046D:4003.0009/input/input32
[ 1162.062013] logitech-hidpp-device 0003:046D:4003.0009: input,hidraw3: USB HID v1.11 Keyboard [Logitech K270] on usb-0000:07:00.3-1/input2:2
[ 1162.176015] logitech-hidpp-device 0003:046D:4003.0009: HID++ 2.0 device connected.
[ 3171.121897] snd_hda_codec_hdmi hdaudioC0D0: HDMI: audio coding xtype 11 not expected
[ 5079.763522] overlayfs: "xino" feature enabled using 2 upper inode bits.
[ 5079.831385] overlayfs: "xino" feature enabled using 2 upper inode bits.
[ 6396.239435] CoreFreq(1:-1:-1): Processor [ 8F_71] Architecture [Zen2/Matisse] CPU [2/2]
ewout@enthoo ~ % sudo dmesg | grep -i corefreq
[ 6396.239435] CoreFreq(1:-1:-1): Processor [ 8F_71] Architecture [Zen2/Matisse] CPU [2/2]
ewout@enthoo ~ % lsmod | grep -i corefreqk
corefreqk             622592  0
ewout@enthoo ~ %

Starting Daemon in debug mode will help you check if one monitoring loop thread is running per CPU. Thus, start with corefreqd -d

ewout@enthoo ~ % sudo corefreqd -d
Driver connection error code 13
Version 0.0.0: 'Permission denied' @ line 9213
ewout@enthoo ~ %

🤷‍♂️

cyring commented 1 year ago

What's going on are linked to the two Cores.

CoreFreq(1:-1:-1): Processor [ 8F_71] Architecture [Zen2/Matisse] CPU [2/2]

To minimize the consumed power I won't touch the Core count and left enabled those BIOS settings. Same with SMT. I will make sure all C-States, P-States, DVFS are enabled, preferably AUTO.

PL, Power Limiter, is what you should tune. ASUS provides an easy ECO mode: down to 65W on my 3950X. But you loose horse power if that don't mind.

You can also tune the Vcore, SoC, and DRAM voltage

Think also that kernel will migrate at its best all tasks to the two total CPUs. It's bad balance to my taste. I would rather enabled all Cores, including SMT, and take into account the performance score of each CPU. See the CPPC ratios listed in CoreFreq With that knowledge you can then map efficiently the task-cpu affinity. Containers, Virtualization can also make use of power affinity.

So far, the Daemon issue is due to unsupported factors.

cyring commented 1 year ago

Meanwhile I have tested a similar Core count reduction:

Driver, Daemon and CLI are doing OK

CoreFreq(1:-1:-1): Processor [ 8F_71] Architecture [Zen2/Matisse] CPU [2/2]

2023-03-14-192131_642x427_scrot

2023-03-14-192140_644x284_scrot

emansom commented 1 year ago

SMT also appears disabled. Not recommended because some features are deactivated like S3 SuspendToRam, other registers...

The 3500X is a 3600 with disabled SMT from factory, can't enable it. S3 works just fine.

To minimize the consumed power I won't touch the Core count and left enabled those BIOS settings. Same with SMT. I will make sure all C-States, P-States, DVFS are enabled, preferably AUTO.

PL, Power Limiter, is what you should tune. ASUS provides an easy ECO mode: down to 65W on my 3950X. But you loose horse power if that don't mind.

The PC is used primarily as HTPC, just watching online content. The build uses a NH-L9x65 and in its stock configuration the cooler couldn't keep up. With a custom "35W Eco Mode" (PPT = 40W, TDC = 35A, EDC = 40A) all is fine again. The disablement of cores was done to ensure it never ramps up any fans.

You can also tune the Vcore, SoC, and DRAM voltage

XMP is disabled, that's about it on underclocking.

Think also that kernel will migrate at its best all tasks to the two total CPUs. It's bad balance to my taste. I would rather enabled all Cores, including SMT, and take into account the performance score of each CPU. See the CPPC ratios listed in CoreFreq With that knowledge you can then map efficiently the task-cpu affinity. Containers, Virtualization can also make use of power affinity.

In my testing C6 state wasn't toggled much in those scenarios, when playing DRM encrypted videos in Firefox (no hardware acceleration support in Widevine on Linux, so CPU decryption + CPU H264 decode). Having only two cores enabled is more energy efficient.

emansom commented 1 year ago

So far, the Daemon issue is due to unsupported factors.

Yes it seems not related to 6.3 indeed. Probably should migrate to a new issue. Thanks for trying to reproduce 👍🏻

cyring commented 1 year ago

Can you put Processor back running CoreFreq and provide CLI outputs and UI screenshots. There is certainly things I'm not aware above 3500X As an example a page like the 3950X Thank you for this

gel-crabs commented 1 year ago

journalctl.txt

Here's what occurs when trying to run CoreFreq on 6.3 with the latest commits

Edit: It is due to the daemon and not the kernel module itself; modprobing corefreqk is fine but the daemon gives the kernel bug.

cyring commented 1 year ago

journalctl.txt

Here's what occurs when trying to run CoreFreq on 6.3 with the latest commits

Edit: It is due to the daemon and not the kernel module itself; modprobing corefreqk is fine but the daemon gives the kernel bug.

Reading journal, issue is said to be linked with a memset.

If this only happens when built when ARCH_PMC=UMC then I'm going to this line:

https://github.com/cyring/CoreFreq/blob/85d0b2e4940532c8d4cc88b8545ee6cfe6222cd1/corefreqk.c#L22858

Printing the computed and real allocation size in both cases:

with ARCH_PMC=UMC   : procSize[28672][26720]
without ARCH_PMC=UMC    : procSize[28672][25088]

Rounded to page size, allocation of 28672 is OK with kmalloc and memset

cyring commented 1 year ago

@gel-crabs

But memset appears to be a consequence of a mmap issue. In function CoreFreqK_mmap can you comment lines about vm_flags_reset_once as below:

static int CoreFreqK_mmap(struct file *pfile, struct vm_area_struct *vma)
{
    unsigned long reqSize = vma->vm_end - vma->vm_start;
    #if LINUX_VERSION_CODE >= KERNEL_VERSION(6, 3, 0)
/*  vm_flags_t vm_ro = VM_READ;*/
    #endif
    int rc = -EIO;
    UNUSED(pfile);

  if (vma->vm_pgoff == ID_RO_VMA_PROC) {
    if (PUBLIC(RO(Proc)) != NULL)
    {
    const unsigned long secSize = ROUND_TO_PAGES(sizeof(PROC_RO));
    if (reqSize != secSize) {
        rc = -EAGAIN;
        goto EXIT_PAGE;
    }

    #if LINUX_VERSION_CODE >= KERNEL_VERSION(6, 3, 0)
/*  vm_flags_reset_once(vma, vm_ro);*/
    #else
    vma->vm_flags = VM_READ;
    #endif
    vma->vm_page_prot = PAGE_READONLY;

    rc = remap_pfn_range(   vma,
                vma->vm_start,
            virt_to_phys((void *) PUBLIC(RO(Proc))) >> PAGE_SHIFT,
                reqSize,
                vma->vm_page_prot);
    }
  } else if (vma->vm_pgoff == ID_RW_VMA_PROC) {
    if (PUBLIC(RW(Proc)) != NULL)
    {
    const unsigned long secSize = ROUND_TO_PAGES(sizeof(PROC_RW));
    if (reqSize != secSize) {
        rc = -EAGAIN;
        goto EXIT_PAGE;
    }

    rc = remap_pfn_range(   vma,
                vma->vm_start,
            virt_to_phys((void *) PUBLIC(RW(Proc))) >> PAGE_SHIFT,
                reqSize,
                vma->vm_page_prot);
    }
  } else if (vma->vm_pgoff == ID_RO_VMA_GATE) {
    if (PUBLIC(RO(Proc)) != NULL)
    {
    switch (SysGate_OnDemand()) {
    default:
    case -1:
        break;
    case 1:
        fallthrough;
    case 0: {
        const unsigned long
        secSize = PAGE_SIZE << PUBLIC(RO(Proc))->Gate.ReqMem.Order;
        if (reqSize != secSize) {
            return -EAGAIN;
        }

        #if LINUX_VERSION_CODE >= KERNEL_VERSION(6, 3, 0)
/*      vm_flags_reset_once(vma, vm_ro);*/
        #else
        vma->vm_flags = VM_READ;
        #endif
        vma->vm_page_prot = PAGE_READONLY;

        rc = remap_pfn_range(   vma,
                    vma->vm_start,
            virt_to_phys((void *) PUBLIC(OF(Gate))) >> PAGE_SHIFT,
                    reqSize,
                    vma->vm_page_prot);
        }
        break;
    }
    }
  } else if ((vma->vm_pgoff >= ID_RO_VMA_CORE)
      && (vma->vm_pgoff < ID_RW_VMA_CORE))
  {
    signed int cpu = vma->vm_pgoff - ID_RO_VMA_CORE;

    if (PUBLIC(RO(Proc)) != NULL) {
      if ((cpu >= 0) && (cpu < PUBLIC(RO(Proc))->CPU.Count)) {
    if (PUBLIC(RO(Core, AT(cpu))) != NULL)
    {
        const unsigned long secSize = ROUND_TO_PAGES(sizeof(CORE_RO));
        if (reqSize != secSize) {
            rc = -EAGAIN;
            goto EXIT_PAGE;
        }

        #if LINUX_VERSION_CODE >= KERNEL_VERSION(6, 3, 0)
/*      vm_flags_reset_once(vma, vm_ro);*/
        #else
        vma->vm_flags = VM_READ;
        #endif
        vma->vm_page_prot = PAGE_READONLY;

        rc = remap_pfn_range(   vma,
                    vma->vm_start,
        virt_to_phys((void *) PUBLIC(RO(Core, AT(cpu)))) >> PAGE_SHIFT,
                    reqSize,
                    vma->vm_page_prot);
    }
      }
    }
  } else if ((vma->vm_pgoff >= ID_RW_VMA_CORE)
      && (vma->vm_pgoff < ID_ANY_VMA_JAIL))
  {
    signed int cpu = vma->vm_pgoff - ID_RW_VMA_CORE;

    if (PUBLIC(RO(Proc)) != NULL) {
      if ((cpu >= 0) && (cpu < PUBLIC(RO(Proc))->CPU.Count)) {
    if (PUBLIC(RW(Core, AT(cpu))) != NULL)
    {
        const unsigned long secSize = ROUND_TO_PAGES(sizeof(CORE_RW));
        if (reqSize != secSize) {
            rc = -EAGAIN;
            goto EXIT_PAGE;
        }

        rc = remap_pfn_range(   vma,
                    vma->vm_start,
        virt_to_phys((void *) PUBLIC(RW(Core, AT(cpu)))) >> PAGE_SHIFT,
                    reqSize,
                    vma->vm_page_prot);
    }
      }
    }
  }
EXIT_PAGE:
    return rc;
}

Next please rebuild, reload and test Daemon connection.

cyring commented 1 year ago

Starting Daemon in debug mode will help you check if one monitoring loop thread is running per CPU. Thus, start with corefreqd -d

ewout@enthoo ~ % sudo corefreqd -d
Driver connection error code 13
Version 0.0.0: 'Permission denied' @ line 9213
ewout@enthoo ~ %

Also make sure any previous installation of CoreFreq binaries is removed and please run from the build directory using insmod and ./corefreqd -d

gel-crabs commented 1 year ago

journalctl.txt

corefreqd -d freezes and hangs the console. This is with or without ARCH_PMC=UMC.

cyring commented 1 year ago

journalctl.txt

corefreqd -d freezes and hangs the console. This is with or without ARCH_PMC=UMC.

Do you confirm to have comment vm_flags_reset_once as I've requested above ?

gel-crabs commented 1 year ago

Yes, I commented them out as requested above and rebuilt it.

cyring commented 1 year ago

@gel-crabs

Yes, I commented them out as requested above and rebuilt it.

I still can't reproduce that bug with 6.3 My understanding is that a kernel change prevents driver's memory pages to be accessed from userspace, aka Daemon. Issue @emansom was about an access denied code. Now whatever the vm_flags are, CoreFreq doesn't work as pre 6.3 Until I can't reproduce, it's hard to fix something. I'm using Archlinux and I've built and installed linux 6.3 from this AUR package linux-mainline

I would like to know about your kernel environment, especially the .config to find inside macro directive(s) I'm not aware about.

gel-crabs commented 1 year ago

linux-cachyos-rc 6.3-rc2

config.txt

Link to the patch list: https://github.com/CachyOS/kernel-patches/tree/master/6.3

cyring commented 1 year ago

linux-cachyos-rc 6.3-rc2

config.txt

Link to the patch list: https://github.com/CachyOS/kernel-patches/tree/master/6.3

Thanks. So this is with CachyOS patches.

But I would like to have a working status using the mainline kernel version 6.3

gel-crabs commented 1 year ago

I'm currently compiling it without the per-VMA locks patch to see if that's the problem

cyring commented 1 year ago

I'm currently compiling it without the per-VMA locks patch to see if that's the problem

Thank you very much

gel-crabs commented 1 year ago

It is not the per-VMA locks, I'm going to compile it without any patches now

gel-crabs commented 1 year ago

I also noticed this line in my dmesg:

[ 2.298642] memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL, pid=1 'systemd'

cyring commented 1 year ago

I also noticed this line in my dmesg:

[ 2.298642] memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL, pid=1 'systemd'

Sorry no idea what this is.

I have built, and boot CachyOS from AUR and CoreFreq is running as expected.

gel-crabs commented 1 year ago

I think that's unrelated. I still get the same error without any patches; it seems to have something to do with vm_mmap_pgoff. I'm going to try it with GCC and O2 now

cyring commented 1 year ago

I think that's unrelated. I still get the same error without any patches; it seems to have something to do with vm_mmap_pgoff. I'm going to try it with GCC and O2 now

What Compiler and options were you using during issue ?

gel-crabs commented 1 year ago

Yep, it was Clang. It's working now.

Clang, -flto=auto, and -O3

cyring commented 1 year ago

Yep, it was Clang. It's working now.

Clang, -flto=auto, and -O3

Thank you. Somehow it has to be documented into Readme.

@emansom : Were you also building kernel with clang while Daemon being denied ?

gel-crabs commented 1 year ago

Yep, it was Clang. It's working now. Clang, -flto=auto, and -O3

Thank you. Somehow it has to be documented into Readme.

@emansom : Were you also building kernel with clang while Daemon being denied ?

It was working up to this point, it's most likely to do with this being an RC kernel; these kinds of issues usually get fixed by the time the kernel gets released.

cyring commented 1 year ago

At that point I don't see other reason to keep issue opened. Thank you for your contributions.

emansom commented 1 year ago

@emansom : Were you also building kernel with clang while Daemon being denied ?

Not sure which compiler is triggered by the PKGBUILD (_modified locally to pull from the linux_63 branch).

I'd guess it defaulted to gcc, not clang/llvm.

cyring commented 1 year ago

Yep, it was Clang. It's working now.

Clang, -flto=auto, and -O3

no LTO (by default)

cyring commented 1 year ago

I have been able to reproduce the issue and setting off CONFIG_FORTIFY_SOURCE with a clang/llvm built kernel is working.

Make CoreFreq as below when clang is in used:

make CC=clang LLVM=1 clean all

Make sure of this prerequisite:

zcat /proc/config.gz | grep CONFIG_FORTIFY_SOURCE
# CONFIG_FORTIFY_SOURCE is not set