cyring / CoreFreq

CoreFreq : CPU monitoring and tuning software designed for 64-bit processors.
https://www.cyring.fr
GNU General Public License v2.0
2.01k stars 127 forks source link

module crashed with Xeon SandyBridge/EP #243

Closed uzername123 closed 3 years ago

uzername123 commented 3 years ago

centos 7.6/kernel 3.10.0-957.27 - module crashed on insmod

MB is GA-x79-ud3 v1.0, cpu is Xeon e5-2689

=================== [ 270.192343] CoreFreq(0:8): Processor [ 06_2D] Architecture [SandyBridge/eXtreme.EP] SMT [16/16] [ 270.192446] general protection fault: 0000 [#1] SMP [ 270.192498] Modules linked in: corefreqk(OE+) xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables devlink ip6table_filter ip6_tables iptable_filter dm_mirror dm_region_hash dm_log dm_mod nvidia_drm(POE) nvidia_modeset(POE) vfat fat nvidia(POE) iTCO_wdt iTCO_vendor_support intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul snd_hda_codec_hdmi ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel joydev pcspkr sg snd_hda_codec drm_kms_helper snd_hda_core i2c_i801 lpc_ich snd_hwdep snd_seq syscopyarea snd_seq_device sysfillrect [ 270.193334] sysimgblt fb_sys_fops snd_pcm drm snd_timer mei_me mei snd soundcore drm_panel_orientation_quirks ioatdma dca tpm_infineon nfsd auth_rpcgss nfs_acl lockd grace binfmt_misc sunrpc ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic mxm_wmi crct10dif_pclmul crct10dif_common crc32c_intel ahci serio_raw e1000e libahci libata ptp pps_core wmi [ 270.193393] CPU: 0 PID: 0 Comm: swapper/0 Kdump: loaded Tainted: P OE ------------ 3.10.0-957.27.2.el7.x86_64 #1 [ 270.193393] Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./X79-UD3, BIOS F20 03/19/2014 [ 270.193393] task: ffffffffb2418480 ti: ffffffffb2400000 task.ti: ffffffffb2400000 [ 270.193393] RIP: 0010:[] [] Start_Uncore_SandyBridge_EP+0x5d/0x70 [corefreqk] [ 270.193393] RSP: 0018:ffff9f83af203f58 EFLAGS: 00010046 [ 270.193393] RAX: 0000000020000000 RBX: ffff9f83a550d000 RCX: 0000000000000c00 [ 270.193393] RDX: 0000000000000000 RSI: ffff9f83a5509000 RDI: 0000000000000000 [ 270.193393] RBP: ffff9f83af203f58 R08: ffff9f83a5509000 R09: 000000000000003a [ 270.193393] R10: 000000acb9ba665c R11: 0000000ad657b7b8 R12: 0000000ab38fe39e [ 270.193393] R13: 000000009a3c7b50 R14: 00000000003e4692 R15: 0000009e5d79e054 [ 270.193393] FS: 0000000000000000(0000) GS:ffff9f83af200000(0000) knlGS:0000000000000000 [ 270.193393] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 270.193393] CR2: 00007f074b896000 CR3: 0000000023610000 CR4: 00000000000607f0 [ 270.193393] Call Trace: [ 270.193393] [ 270.193393] [] Start_SandyBridge_EP+0x19f/0x270 [corefreqk] [ 270.193393] [] flush_smp_call_function_queue+0x63/0x130 [ 270.193393] [] generic_smp_call_function_single_interrupt+0x13/0x30 [ 270.193393] [] smp_call_function_single_interrupt+0x2d/0x40 [ 270.193393] [] call_function_single_interrupt+0x162/0x170 [ 270.193393] [ 270.193393] [] ? hrtimer_start_range_ns+0x1ed/0x3c0 [ 270.193393] [] ? cpuidle_enter_state+0x54/0xd0 [ 270.193393] [] ? cpuidle_enter_state+0x4d/0xd0 [ 270.193393] [] cpuidle_idle_call+0xde/0x230 [ 270.193393] [] arch_cpu_idle+0xe/0xc0 [ 270.193393] [] cpu_startup_entry+0x14a/0x1e0 [ 270.193393] [] rest_init+0x77/0x80 [ 270.193393] [] start_kernel+0x44b/0x46c [ 270.193393] [] ? repair_env_string+0x5c/0x5c [ 270.193393] [] ? early_idt_handler_array+0x120/0x120 [ 270.193393] [] x86_64_start_reservations+0x24/0x26 [ 270.193393] [] x86_64_start_kernel+0x154/0x177 [ 270.193393] [] start_cpu+0x5/0x14 [ 270.193393] Code: c2 48 c1 ea 20 0f 30 30 c9 0f 32 48 c1 e2 20 89 c0 48 09 c2 48 89 d0 48 89 96 38 01 00 00 48 0d 00 00 00 20 48 89 c2 48 c1 ea 20 <0f> 30 5d c3 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 66 66 [ 270.193393] RIP [] Start_Uncore_SandyBridge_EP+0x5d/0x70 [corefreqk] [ 270.193393] RSP

cyring commented 3 years ago

According to register RCX read or write access to MSR_SNB_EP_PMON_GLOBAL_CTRL (0x00000c00) is the issue.

Sorry, the Uncore PMU of Xeon E5 (06_2D) is programmed with another set of MSR registers I have not implemented yet.

Can you please pull and test the develop branch. I just have comment that Uncore part of software.

uzername123 commented 3 years ago

Compiled dev branch, seems all worked ok on E5-2689. At least no crash/kerneldump :)

cyring commented 3 years ago

Compiled dev branch, seems all worked ok on E5-2689. At least no crash/kerneldump :)

Great! But we are missing the Uncore frequency in the view "Package cycles" and probably other things I could notice if you share corefreq-cli outputs, screenshots.

Your Xeon is specified with Uncore and IMC box registers which depend on its topology, especially if multi-sockets. Are you OK to test new codes ? This may take time, about 5-15 working days, multiple testings and few crashes.

uzername123 commented 3 years ago

ok. i have several types of Xeon cpu , i could crash it almost freely - :) it is 2xL56xx, 2xE56xx, 2x X56xx, E5-26xx single cpu systems, single and dual e5-26xx v2 what system i need to use for test/crash?

cyring commented 3 years ago

ok. i have several types of Xeon cpu , i could crash it almost freely - :) it is 2xL56xx, 2xE56xx, 2x X56xx, E5-26xx single cpu systems, single and dual e5-26xx v2 what system i need to use for test/crash?

Hello,

Thanks for your help.

I would like to focus on Xeon SandyBridge EP processors (as I have already programmed for Westmere, a W3690 single socket)

Later, dual or quad sockets will be the purpose of improving the topology established by CoreFreq and the, per cluster, various measurements.

As a starter, any SNB, IVB with exactly the same CPUID signature of this issue, CPUID 06_2D will be just fine to implement the Uncore MSR.

About the kernel choice, CentOS if not too old can do the job, but preferably a 5.X kernel version. Unloading, black-listing, the Linux and/or Vendor drivers will provide CoreFreq a full R/W access to registers. Especially blacklist the nmi_watchdog. Thus a bare-metal distribution will be perfect. My favorite being ArchLinux.

Of course: gcc, libc, kernel-headers, make, git as the Compilation prerequisites. Any code editor of your choice: nano, vi

Regards, Cyril

cyring commented 3 years ago

Btw, in the past, I had programmed IvyBridge/EP which shares in driver the same Uncore code than your SNB/EP

IVB/EP is CPUID 06_3E

Screenshot in Wiki https://gist.github.com/cyring/4c1a1f895e53ece642a52c368bdbaf3b

So your Xeon 06_2D is really the Processor I need to work on.

cyring commented 3 years ago

Hello,

For your tests, latest develop branch has new code for your Xeon 06_2D

Can you please give it a try ?

  1. Make sure not to mix code among CoreFreq sources: pull the development branch only
  2. Each time, fully rebuild the software with make clean all
  3. Be sure to stop/unload any previous instances, in that order: client, daemon, driver
  4. Save, close your personal files; sync your FS
  5. Now start the freshly built driver
  6. Check the kernel log message for any error. Hope no crash at this time.
  7. Start the daemon, preferably in debug mode corefreqd -d
  8. Start the client and go to the "Package cycles" view to check if the UNCORE counter is giving cycles
  9. Post the various outputs, like corefreq-cli -s

Thank you

uzername123 commented 3 years ago

ok. i compiled dev branch already. Yes, my main Workstation is dual 2680v2 and also have 2650v2 uniprocessor Will check all mentioned ASAP tomorrow My working OS is Centos 7.6 with 3.10.x kernel (requirements) so need to install/compile 7.9 with recent 5.x kernel (prefer not use 8.2/8.3) Btw, is it possible to make/use msr registers for unlock Turboboost modes or so in diff CPUs (say modify cores qty/frequencies for boost mode)?

cyring commented 3 years ago

I believe you can program directly MSR to alter the Turbo frequency. You will refer to the Intel SDM specifications to locate the tables associated with the CPUID. The bits layout and its programming logic differ per architecture. For example, with or without a semaphore bit to finalize the read-modify-write operation.

uzername123 commented 3 years ago

image Here is on 2689 unicpu system 7,6 kernel 3.10.0.-967.27.2

cyring commented 3 years ago

image Here is on 2689 unicpu system 7,6 kernel 3.10.0.-967.27.2

Thanks a lot for your test. No more crash but apparently UNCORE: remains at zero which means that its counter is not started. Thus there's more work to do. Please let me know if you wish to pursue with code testings. Regards