cyring / CoreFreq

CoreFreq : CPU monitoring and tuning software designed for 64-bit processors.
https://www.cyring.fr
GNU General Public License v2.0
2k stars 126 forks source link

Possible cause for kernel hang #145

Closed terencode closed 4 years ago

terencode commented 5 years ago

I installed the latest version from master and modprobed the module and after some time my machine became completely unresponsive. I could however ssh into it and retrieve the following:

Uhhuh. NMI received for unknown reason 1c on CPU 9.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 0c on CPU 9.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 0c on CPU 8.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 0c on CPU 8.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 1c on CPU 8.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 0c on CPU 8.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 0c on CPU 9.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 1c on CPU 9.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 0c on CPU 11.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 1c on CPU 8.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 1c on CPU 8.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 1c on CPU 9.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 0c on CPU 8.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 0c on CPU 8.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 0c on CPU 8.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 0c on CPU 8.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 0c on CPU 8.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 0c on CPU 8.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 0c on CPU 8.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 0c on CPU 9.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 0c on CPU 9.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 1c on CPU 8.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 1c on CPU 9.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 1c on CPU 8.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 0c on CPU 9.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 0c on CPU 11.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 0c on CPU 8.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 1c on CPU 9.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 0c on CPU 9.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 1c on CPU 9.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 0c on CPU 9.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 0c on CPU 9.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 1c on CPU 8.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 0c on CPU 8.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 1c on CPU 8.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 1c on CPU 8.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 0c on CPU 9.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 0c on CPU 8.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 0c on CPU 8.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 1c on CPU 9.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 1c on CPU 9.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 1c on CPU 9.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 1c on CPU 10.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 0c on CPU 9.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 0c on CPU 8.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
pam_unix(sudo:session): session closed for user root
 terence : TTY=pts/0 ; PWD=/home/terence ; USER=root ; COMMAND=/usr/bin/perf top
pam_unix(sudo:session): session opened for user root by terence(uid=0)
Uhhuh. NMI received for unknown reason 1c on CPU 9.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 1c on CPU 9.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 1c on CPU 8.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 1c on CPU 9.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 0c on CPU 9.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 0c on CPU 9.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
INFO: task Xorg:37121 blocked for more than 122 seconds.
      Tainted: G           OE     5.2.16-32-ck1-tkg-MuQSS #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Xorg            D    0 37121    916 0x00004004
Call Trace:
 ? __schedule+0x6d6/0xf30
 ? schedule_timeout+0x30b/0x500
 schedule+0x8c/0x220
 schedule_timeout+0x30b/0x500
 wait_for_common.constprop.0+0xcf/0x150
 ? wake_up_q+0x60/0x60
 __flush_work+0x156/0x210
 ? flush_workqueue_prep_pwqs+0x130/0x130
 drain_all_pages+0x14e/0x1a0
 __alloc_pages_nodemask+0x7ff/0x1210
 ttm_pool_populate+0x39e/0x5e0 [ttm]
 ttm_populate_and_map_pages+0x25/0x2b0 [ttm]
 ? kvmalloc_node+0x47/0x80
 ttm_tt_bind+0x3c/0xa0 [ttm]
 ttm_bo_handle_move_mem+0x2bd/0x580 [ttm]
 ? drm_mm_insert_node_in_range+0x335/0x480 [drm]
 ttm_bo_validate+0x26b/0x2d0 [ttm]
 ttm_bo_init_reserved+0x334/0x380 [ttm]
 amdgpu_bo_do_create+0x1a9/0x480 [amdgpu]
 ? amdgpu_bo_subtract_pin_size+0x50/0x50 [amdgpu]
 amdgpu_bo_create+0x43/0x200 [amdgpu]
 ? skb_copy_datagram_from_iter+0x60/0x1c0
 amdgpu_gem_create_ioctl+0x14a/0x330 [amdgpu]
 ? amdgpu_gem_object_close+0x1c0/0x1c0 [amdgpu]
 drm_ioctl_kernel+0xb8/0x100 [drm]
 drm_ioctl+0x253/0x3f0 [drm]
 ? amdgpu_gem_object_close+0x1c0/0x1c0 [amdgpu]
 amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
 do_vfs_ioctl+0x43d/0x7a0
 __x64_sys_ioctl+0x62/0x90
 do_syscall_64+0x4e/0x120
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7fea89ba821b
Code: Bad RIP value.
RSP: 002b:00007ffe4cf5c8d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007ffe4cf5c930 RCX: 00007fea89ba821b
RDX: 00007ffe4cf5c930 RSI: 00000000c0206440 RDI: 000000000000000e
RBP: 00000000c0206440 R08: 000055af123beec0 R09: 00007fea89c71b00
R10: 00007ffe4cfd3080 R11: 0000000000000246 R12: 000055af123beec0
R13: 000000000000000e R14: 00000000003ae000 R15: 000055af0fdcda40
Uhhuh. NMI received for unknown reason 1c on CPU 8.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 0c on CPU 9.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 0c on CPU 9.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 0c on CPU 9.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 0c on CPU 9.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 1c on CPU 8.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 0c on CPU 8.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 0c on CPU 9.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 0c on CPU 9.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 0c on CPU 8.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 0c on CPU 9.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 1c on CPU 10.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 0c on CPU 9.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
pam_unix(sudo:session): session closed for user root
Uhhuh. NMI received for unknown reason 0c on CPU 11.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 1c on CPU 8.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 1c on CPU 11.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 1c on CPU 8.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 1c on CPU 9.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 1c on CPU 9.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 0c on CPU 9.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 0c on CPU 9.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 0c on CPU 9.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 0c on CPU 9.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 0c on CPU 9.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 1c on CPU 9.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 1c on CPU 9.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 0c on CPU 9.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 1c on CPU 8.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 1c on CPU 8.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 0c on CPU 8.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 1c on CPU 9.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 1c on CPU 8.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 0c on CPU 8.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 0c on CPU 8.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 0c on CPU 9.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 1c on CPU 9.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 0c on CPU 9.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 1c on CPU 9.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 0c on CPU 9.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
INFO: task Xorg:37121 blocked for more than 245 seconds.
      Tainted: G           OE     5.2.16-32-ck1-tkg-MuQSS #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Xorg            D    0 37121    916 0x00004004
Call Trace:
 ? __schedule+0x6d6/0xf30
 ? schedule_timeout+0x30b/0x500
 schedule+0x8c/0x220
 schedule_timeout+0x30b/0x500
 wait_for_common.constprop.0+0xcf/0x150
 ? wake_up_q+0x60/0x60
 __flush_work+0x156/0x210
 ? flush_workqueue_prep_pwqs+0x130/0x130
 drain_all_pages+0x14e/0x1a0
 __alloc_pages_nodemask+0x7ff/0x1210
 ttm_pool_populate+0x39e/0x5e0 [ttm]
 ttm_populate_and_map_pages+0x25/0x2b0 [ttm]
 ? kvmalloc_node+0x47/0x80
 ttm_tt_bind+0x3c/0xa0 [ttm]
 ttm_bo_handle_move_mem+0x2bd/0x580 [ttm]
 ? drm_mm_insert_node_in_range+0x335/0x480 [drm]
 ttm_bo_validate+0x26b/0x2d0 [ttm]
 ttm_bo_init_reserved+0x334/0x380 [ttm]
 amdgpu_bo_do_create+0x1a9/0x480 [amdgpu]
 ? amdgpu_bo_subtract_pin_size+0x50/0x50 [amdgpu]
 amdgpu_bo_create+0x43/0x200 [amdgpu]
 ? skb_copy_datagram_from_iter+0x60/0x1c0
 amdgpu_gem_create_ioctl+0x14a/0x330 [amdgpu]
 ? amdgpu_gem_object_close+0x1c0/0x1c0 [amdgpu]
 drm_ioctl_kernel+0xb8/0x100 [drm]
 drm_ioctl+0x253/0x3f0 [drm]
 ? amdgpu_gem_object_close+0x1c0/0x1c0 [amdgpu]
 amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
 do_vfs_ioctl+0x43d/0x7a0
 __x64_sys_ioctl+0x62/0x90
 do_syscall_64+0x4e/0x120
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7fea89ba821b
Code: Bad RIP value.
RSP: 002b:00007ffe4cf5c8d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007ffe4cf5c930 RCX: 00007fea89ba821b
RDX: 00007ffe4cf5c930 RSI: 00000000c0206440 RDI: 000000000000000e
RBP: 00000000c0206440 R08: 000055af123beec0 R09: 00007fea89c71b00
R10: 00007ffe4cfd3080 R11: 0000000000000246 R12: 000055af123beec0
R13: 000000000000000e R14: 00000000003ae000 R15: 000055af0fdcda40
Uhhuh. NMI received for unknown reason 1c on CPU 9.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 0c on CPU 9.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 0c on CPU 8.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 1c on CPU 8.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 1c on CPU 8.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 0c on CPU 9.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 0c on CPU 8.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 1c on CPU 9.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 0c on CPU 9.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 1c on CPU 8.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 0c on CPU 9.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
Uhhuh. NMI received for unknown reason 1c on CPU 8.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
INFO: task Xorg:37121 blocked for more than 368 seconds.
      Tainted: G           OE     5.2.16-32-ck1-tkg-MuQSS #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Xorg            D    0 37121    916 0x00004004
Call Trace:
 ? __schedule+0x6d6/0xf30
 ? schedule_timeout+0x30b/0x500
 schedule+0x8c/0x220
 schedule_timeout+0x30b/0x500
 wait_for_common.constprop.0+0xcf/0x150
 ? wake_up_q+0x60/0x60
 __flush_work+0x156/0x210
 ? flush_workqueue_prep_pwqs+0x130/0x130
 drain_all_pages+0x14e/0x1a0
 __alloc_pages_nodemask+0x7ff/0x1210
 ttm_pool_populate+0x39e/0x5e0 [ttm]
 ttm_populate_and_map_pages+0x25/0x2b0 [ttm]
 ? kvmalloc_node+0x47/0x80
 ttm_tt_bind+0x3c/0xa0 [ttm]
 ttm_bo_handle_move_mem+0x2bd/0x580 [ttm]
 ? drm_mm_insert_node_in_range+0x335/0x480 [drm]
 ttm_bo_validate+0x26b/0x2d0 [ttm]
 ttm_bo_init_reserved+0x334/0x380 [ttm]
 amdgpu_bo_do_create+0x1a9/0x480 [amdgpu]
 ? amdgpu_bo_subtract_pin_size+0x50/0x50 [amdgpu]
 amdgpu_bo_create+0x43/0x200 [amdgpu]
 ? skb_copy_datagram_from_iter+0x60/0x1c0
 amdgpu_gem_create_ioctl+0x14a/0x330 [amdgpu]
 ? amdgpu_gem_object_close+0x1c0/0x1c0 [amdgpu]
 drm_ioctl_kernel+0xb8/0x100 [drm]
 drm_ioctl+0x253/0x3f0 [drm]
 ? amdgpu_gem_object_close+0x1c0/0x1c0 [amdgpu]
 amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
 do_vfs_ioctl+0x43d/0x7a0
 __x64_sys_ioctl+0x62/0x90
 do_syscall_64+0x4e/0x120
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7fea89ba821b
Code: Bad RIP value.
RSP: 002b:00007ffe4cf5c8d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007ffe4cf5c930 RCX: 00007fea89ba821b
RDX: 00007ffe4cf5c930 RSI: 00000000c0206440 RDI: 000000000000000e
RBP: 00000000c0206440 R08: 000055af123beec0 R09: 00007fea89c71b00
R10: 00007ffe4cfd3080 R11: 0000000000000246 R12: 000055af123beec0
R13: 000000000000000e R14: 00000000003ae000 R15: 000055af0fdcda40

Running perf top showed 90% of cpu time was used in collect_percpu_times

ArchLinux with MuQSS kernel 5.2.16

cyring commented 5 years ago

It's hard to tell if it's HW or SW interactions but can you boot my ISO image in Wiki ? It's built with the mainstream Arch Linux kernel, no MuQSS flavor, no perf, no software CoreFreq may conflict with.

terencode commented 5 years ago

The problem is it's hard to reproduce, it only happened after some hours.

cyring commented 5 years ago

I'm not familiar with the MuQSS scheduler but I believe it is real-time based and CoreFreq does assembly bus locks (to sync threads). Especially in user-space, the Daemon is locking bus to aggregate the per-cpu data. I believe that bus locking may disturb a real-time scheduler.

In the Client, menu Settings, you can enable the NMI counters. Next, chose the view System Interrupts to monitor NMI

Then, I wonder if the mainstream scheduler (ie ISO image or one of your other boot options to go without MuQSS) will show as much as NMI counts ?

terencode commented 5 years ago

I'll do that and report with MuQSS vs CFS which is the default.

terencode commented 5 years ago

Here you go: I hope it doesn't matter how many apps are started. CFS: Screenshot from 2019-09-21 19-03-01 MuQSS: Screenshot from 2019-09-21 18-52-58

cyring commented 5 years ago

As soon as I start perf, such as perf top, CoreFreq counts several Local NMI 2019-09-21-190758_724x580_scrot

Are you running perf in the same time than CoreFreq ? because they both conflict on the PMC registers.

terencode commented 5 years ago

When I took the screenshots no. I ran it after I noticed the freeze to try to diagnose what was going on.

cyring commented 5 years ago

I see, AMD processor, no PMC involved, my above comment is relevant for Intel CPU only.

I believe that without starting CoreFreq driver, you don't encounter such NMIs ?

terencode commented 5 years ago

I don't indeed.

cyring commented 5 years ago

Do you have lm-sensors with one k10temp driver running ?

terencode commented 5 years ago

I'm using https://github.com/ocerman/zenpower

cyring commented 5 years ago

So there ...

https://github.com/ocerman/zenpower/blob/d577d3b9b445e46ffc7fa5f49c38f3e4c1ddaf0e/zenpower.c#L290

and here ...

https://github.com/cyring/CoreFreq/blob/e5f3ba5c356c9e2eae631dc876791e39928d1d6c/corefreqk.c#L6345

... may happen a SMU register usage conflict on writing periodically the offset 0x00059800 to read the TCL temperature sensor.

Apparently, for the same reason, Zenpower is asking to unload k10temp, you will have to unload the Zenpower module prior starting CoreFreq

Does it run better ?

terencode commented 5 years ago

Running modprobe -r zenpower before probing the module: image

Is there a performance cost while this module is running? It's also a bummer you can't monitor your cpu temperature...

cyring commented 5 years ago

Running modprobe -r zenpower before probing the module:

So you confirm the system is stable with CoreFreq only ?

Is there a performance cost while this module is running?

Minimizing the CPU overhead is my top priority.

It's also a bummer you can't monitor your cpu temperature...

For Ryzen, I have so far implemented the only sensor which according to specs is a socket scope register. I presume that a 2 sockets setup may offer 2 sensors but I don't have any Zen processor yet to test with.

terencode commented 5 years ago

So you confirm the system is stable with CoreFreq only ?

Haven't tested long enough but it seemed fine.

For Ryzen, I have so far implemented the only sensor which according to specs is a socket scope register. I presume that a 2 sockets setup may offer 2 sensors but I don't have any Zen processor yet to test with.

Is this only obtainable when running the cli or will it register it with sensors-detect like zenpower or k10temp?

cyring commented 5 years ago

Is this only obtainable when running the cli or will it register it with sensors-detect like zenpower or k10temp?

It's a different approach: CoreFreq does not rely on other libs.
Talking about sensors-detect means lm_sensors which CoreFreq does not use at all. That's why Processor registers usage conflict may happen b/c corefreqk.ko competes with any other drivers to claim an exclusive access on Processor resources: msr, pci, and some control registers. The drawback is that every bits of my program is written from scratch.

terencode commented 5 years ago

Ok so the problem means I can't have temperature monitoring while I'm using it then.

cyring commented 5 years ago

But you have the temperature; in your screenshot, it is written in the footer :

T[53]

You can also read it in the startup view Frequency. It is written in column TMP with its min and it's max.

My purpose was to say that temperature is given per Processor (and not for each Core) That's the way the sensor is specified for the Zen architecture and only one software can monitor it.

I hope it will help.

terencode commented 5 years ago

Ah thanks for the explanation, I understand better now. What I mean is I can't use my regular monitoring software.

cyring commented 5 years ago

What I mean is I can't use my regular monitoring software.

I will improve my driver compatibility with the kernel by using the function amd_smn_read which serializes the SMU access through a mutex. It should let CoreFreq and lm_sensors run simultaneously.

I'll send or post codes that you will change in sources then build and test.

terencode commented 5 years ago

Sounds good, thanks.

cyring commented 5 years ago

Hello,

Attached the version 1.67 for your tests.

CoreFreq_1670.tar.gz

terencode commented 5 years ago

Thanks, however, it'd be easier for me if you would just push it to a new branch please.

terencode commented 5 years ago

I can't install it, insmod corefreqd.ko made a hard hang and make install says:

- SSL error:02001002:system library:fopen:No such file or directory: crypto/bio/bss_file.c:69
- SSL error:2006D080:BIO routines:BIO_new_file:no such file: crypto/bio/bss_file.c:76
sign-file: certs/signing_key.pem: No such file or directory
cyring commented 5 years ago

Can you post the output of command :

lspci -nn
  1. as user
    make clean all
  2. as root
    insmod corefreqk.ko
    ./corefreqd
  3. as user
    ./corefreq-cli
terencode commented 5 years ago

No problem ^^ Here you go:

00:00.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Root Complex [1022:1480]
00:00.2 IOMMU [0806]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse IOMMU [1022:1481]
00:01.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
00:01.3 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge [1022:1483]
00:02.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
00:03.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
00:03.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge [1022:1483]
00:04.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
00:05.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
00:07.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
00:07.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B] [1022:1484]
00:08.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
00:08.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B] [1022:1484]
00:08.2 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B] [1022:1484]
00:08.3 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B] [1022:1484]
00:14.0 SMBus [0c05]: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller [1022:790b] (rev 61)
00:14.3 ISA bridge [0601]: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge [1022:790e] (rev 51)
00:18.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 0 [1022:1440]
00:18.1 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 1 [1022:1441]
00:18.2 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 2 [1022:1442]
00:18.3 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 3 [1022:1443]
00:18.4 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 4 [1022:1444]
00:18.5 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 5 [1022:1445]
00:18.6 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 6 [1022:1446]
00:18.7 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 7 [1022:1447]
01:00.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] X370 Series Chipset USB 3.1 xHCI Controller [1022:43b9] (rev 02)
01:00.1 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] X370 Series Chipset SATA Controller [1022:43b5] (rev 02)
01:00.2 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] X370 Series Chipset PCIe Upstream Port [1022:43b0] (rev 02)
02:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port [1022:43b4] (rev 02)
02:02.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port [1022:43b4] (rev 02)
02:03.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port [1022:43b4] (rev 02)
02:04.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port [1022:43b4] (rev 02)
02:05.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port [1022:43b4] (rev 02)
02:06.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port [1022:43b4] (rev 02)
02:07.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port [1022:43b4] (rev 02)
03:00.0 USB controller [0c03]: ASMedia Technology Inc. ASM1143 USB 3.1 Host Controller [1b21:1343]
04:00.0 Ethernet controller [0200]: Intel Corporation I211 Gigabit Network Connection [8086:1539] (rev 03)
0a:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:1470] (rev c1)
0b:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:1471]
0c:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XL/XT [Radeon RX Vega 56/64] [1002:687f] (rev c1)
0c:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 HDMI Audio [Radeon Vega 56/64] [1002:aaf8]
0d:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Function [1022:148a]
0e:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Reserved SPP [1022:1485]
0e:00.1 Encryption controller [1080]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Cryptographic Coprocessor PSPCPP [1022:1486]
0e:00.3 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Matisse USB 3.0 Host Controller [1022:149c]
0e:00.4 Audio device [0403]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse HD Audio Controller [1022:1487]
0f:00.0 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901] (rev 51)
10:00.0 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901] (rev 51)
cyring commented 4 years ago

Hello,

In issue #54 we have tested the use of the kernel SMU API It appears it works.

To build with the kernel API:

make FEAT_DBG=2 clean all

You will read messages among the build which confirm the API usage.

Next you should able to read the temperature in the same time of lm_sensors

terencode commented 4 years ago

Hey, I tried installing it but I get this when using make module-install:

make -C /lib/modules/5.4.1-3-tkg-bmq/build M=/mnt/WDC/Documents/Git/CoreFreq modules_install
make[1]: Entering directory '/usr/lib/modules/5.4.1-3-tkg-bmq/build'
  INSTALL /home/terence/Documents/Git/CoreFreq/corefreqk.ko
At main.c:160:
- SSL error:02001002:system library:fopen:No such file or directory: crypto/bio/bss_file.c:69
- SSL error:2006D080:BIO routines:BIO_new_file:no such file: crypto/bio/bss_file.c:76
sign-file: certs/signing_key.pem: No such file or directory
  DEPMOD  5.4.1-3-tkg-bmq
make[1]: Leaving directory '/usr/lib/modules/5.4.1-3-tkg-bmq/build'
cyring commented 4 years ago

I think your kernel is only loading certified modules. Thus you'll have to sign the CoreFreq driver:

terencode commented 4 years ago

I didn't enforce signed modules only. I can manually insmod the driver fine.

cyring commented 4 years ago

So you confirm it's now working ?

Btw, did you try:

make FEAT_DBG=2 clean all

This will let you use CoreFreq in parallel of k10temp

terencode commented 4 years ago

It seems to work correctly now. However I had to do some changes to your AUR package. Here is the diff:

diff --git a/PKGBUILD b/PKGBUILD
index 0f2a5bd..ebda51c 100644
--- a/PKGBUILD
+++ b/PKGBUILD
@@ -1,9 +1,7 @@
 # Maintainer: CyrIng <labs[at]cyring[dot]fr>
 # Contributor: CyrIng <labs[at]cyring[dot]fr>
-_gitname=CoreFreq
 pkgname=corefreq-git
-realname=corefreq
-pkgver=1.69
+pkgver=r829.b62b3f0
 pkgrel=1
 pkgdesc="CoreFreq, Processor monitoring software with BIOS like functionalities"
 arch=('x86_64')
@@ -11,21 +9,38 @@ url='https://github.com/cyring/CoreFreq'
 license=('GPL2')
 depends=('dkms')
 makedepends=('git')
-source=(git+${url}.git)
-md5sums=('SKIP')
-install=${realname}.install
+source=($pkgname::git+${url}.git
+   'dkms.conf')
+md5sums=('SKIP'
+         '1be42c3d47c2efda9b49d8a7f3d12582')
+install=corefreq.install
+
+pkgver() {
+  cd "$pkgname"
+  printf "r%s.%s" "$(git rev-list --count HEAD)" "$(git rev-parse --short HEAD)"
+}
+
+
+prepare() {
+  cd ${srcdir}/${pkgname}
+  make FEAT_DBG=2 clean all 
+}

 package() {
-   cd ${srcdir}/${_gitname}
-   BINDIR=${pkgdir}/bin
-   SRCTREE=${pkgdir}/usr/src
-   DRVTREE=${SRCTREE}/corefreqk-${pkgver}
-   # dkms setup
-   install -Dm 0644 Makefile ${DRVTREE}/Makefile
-   install -Dm 0644 dkms.conf ${DRVTREE}/dkms.conf
-   install -Dm 0755 scripter.sh ${DRVTREE}/scripter.sh
-   install -m 0644 *.c *.h ${DRVTREE}/
-   # systemd setup
-   install -Dm 0644 corefreqd.service \
-       ${pkgdir}/usr/lib/systemd/system/corefreqd.service
+  cd ${srcdir}/${pkgname}
+
+  BINDIR=${pkgdir}/bin
+  SRCTREE=${pkgdir}/usr/src
+  DRVTREE=${SRCTREE}/corefreqk-${pkgver}
+  # dkms setup
+  install -Dm 0644 ../dkms.conf ${DRVTREE}/dkms.conf
+  sed -e "s/@PKGVER@/${pkgver}/" \
+      -i "${DRVTREE}/dkms.conf"
+
+  install -Dm 0644 Makefile ${DRVTREE}/Makefile
+  install -Dm 0755 scripter.sh ${DRVTREE}/scripter.sh
+  install -m 0644 *.c *.h ${DRVTREE}/
+  # systemd setup
+  install -Dm 0644 corefreqd.service \
+   ${pkgdir}/usr/lib/systemd/system/corefreqd.service
 }
diff --git a/dkms.conf b/dkms.conf
new file mode 100644
index 0000000..8b21081
--- /dev/null
+++ b/dkms.conf
@@ -0,0 +1,27 @@
+# CoreFreq
+# Copyright (C) 2015-2019 CYRIL INGENIERIE
+# Licenses: GPL2
+#
+AUTOINSTALL="yes"
+REMAKE_INITRD="no"
+DRV_PATH=/kernel/drivers/misc
+DRV_VERSION=@PKGVER@
+PACKAGE_NAME="corefreqk"
+PACKAGE_VERSION="$DRV_VERSION"
+BUILT_MODULE_NAME[0]="corefreqk"
+DEST_MODULE_LOCATION[0]="$DRV_PATH"
+CLEAN="make -C $source_tree/$PACKAGE_NAME-$PACKAGE_VERSION clean"
+MAKE[0]="make -C $source_tree/$PACKAGE_NAME-$PACKAGE_VERSION"
+#
+DAEMON="\$source_tree/\$PACKAGE_NAME-\$PACKAGE_VERSION/corefreqd"
+CLIENT="\$source_tree/\$PACKAGE_NAME-\$PACKAGE_VERSION/corefreq-cli"
+SCRIPT="scripter.sh"
+COMMAND="install -Dm 0755 -s -t /bin"
+OBJECTS="\$source_tree/\$PACKAGE_NAME-\$PACKAGE_VERSION/*.o"
+BINARIES="/bin/corefreqd /bin/corefreq-cli"
+CLEANUP="rm -f"
+#
+POST_BUILD="$SCRIPT $COMMAND -- $DAEMON $CLIENT"
+POST_INSTALL="$SCRIPT $CLEANUP -- $OBJECTS"
+POST_REMOVE="$SCRIPT $CLEANUP -- $BINARIES"
+#
cyring commented 4 years ago

Very interesting. I need to process these Package changes, thank you.

Do you mind to show me the CoreFreq temperature of Cores beside those from lm_sensors: I want to check if FEAT_DBG=2 make things accurate ?

terencode commented 4 years ago

Sure. What I did was to auto generate the version as it's a git package and change it dynamically inside dkms.conf. Also I added a prepare() for the FEAT_DBG=2

As such? image

cyring commented 4 years ago

There are some differences but I can't tell if they are due to each software sampling time.

cyring commented 4 years ago

Hello,

Last version 1.69.7 now let you run CoreFreq in parallel of k10temp (lm_sensors).

terencode commented 4 years ago

I'm not aware of the need for a temperature offset. Here is a hopefully more accurate and complete screenshot (notice I'm using https://github.com/electrified/asus-wmi-sensors) : image

cyring commented 4 years ago

Thanks, CoreFreq measurement is identical to zenpower and Asus Core temperature. But I see an issue with the initialization of the minimum temperature which does not happened with previous Zen generation where the offset is the minimum value. Here screenshot shows zero which is a wrong value.
I need to fix this...

cyring commented 4 years ago

Version 1.69.8 is providing a fix to the minimal temperature to be not zero. Feel free to test, thank you.

terencode commented 4 years ago

Looks like it's working, great job :) image

Time to close this now right?

cyring commented 4 years ago

Thanks. Yes you can close the issue.