Closed h1z1 closed 3 years ago
which is a SMT 64-bits register ( _thread[1:0]
)
modprobe msr
rdmsr -aX 0xC00000E9
Sorry for the late reply
Just confirming the same output after the rdmsr
And fwiw
[root@wacko CoreFreq]# ./corefreq-cli
[root@wacko CoreFreq]# rdmsr -aX 0xC00000E9 | xargs
1C6 1C6 1C6 F6BC 17B68 1C6 3234 1C6 E4507 1C6 150D 1C6 14237E D3048 1C6 1C6 401F23 1C7 1C6 1C6 1C6 1C6 1C6 1C6 1C6 1C6 1C6 1C6 1C6 1C6 1C6 CB5
The initial read was all zeros
Can you stress one Core per CCD. Use the CoreFreq Tools and Topology windows. In the same time, get the raw counters elapsed by one second interval:
rdmsr -aX 0xC00000E9
sleep 1
rdmsr -aX 0xC00000E9
B/c I want the highest counter value to guess its bits width but also to understand its sampling period. I suspect EPYC and TR v1 are counting differently than Zen v?
According to your CPUID, the INST counter is capable.
Please also post the output of corefreq-cli -s
Thank you
Hmm... turboing a core in corefreq causes that core to report accurately. Not entirely sure why the units changed from scientific form..
# while true; do c=$(rdmsr -ad 0xC00000E9 |tail -1); sleep 1; l=$(rdmsr -ad 0xC00000E9 |tail -1); echo $((l - c)); done
10776668920
10779641649
10776669139
10775183308
10775979373
10775464271
10776362897
10777289285
10777553372
^C
disabled turbo on CPU 31
# while true; do c=$(rdmsr -ad 0xC00000E9 |tail -1); sleep 1; l=$(rdmsr -ad 0xC00000E9 |tail -1); echo $((l - c)); done
-2896
0
-2885
0
0
0
0
0
0
0
0
0
#
Guessing this might help? You can see where it settled down after the core was turboed
# while true; do ./corefreq-cli -j | jq -c '.["Cpu"][25]["FlipFlop"][0]["State"]' ; sleep 1; done
{"IPS":0,"IPC":5e-06,"CPI":216882,"Turbo":0.000104,"C0":0.000121,"C3":0,"C6":0,"C7":0,"C1":0.999879,"MC6":0}
{"IPS":0,"IPC":5e-06,"CPI":216882,"Turbo":0.000104,"C0":0.000121,"C3":0,"C6":0,"C7":0,"C1":0.999879,"MC6":0}
{"IPS":0,"IPC":0,"CPI":0,"Turbo":0.000184,"C0":0.000209,"C3":0,"C6":0,"C7":0,"C1":0.999791,"MC6":0}
{"IPS":0,"IPC":0,"CPI":0,"Turbo":0.000184,"C0":0.000209,"C3":0,"C6":0,"C7":0,"C1":0.999791,"MC6":0}
{"IPS":0,"IPC":1e-06,"CPI":943884,"Turbo":0.000224,"C0":0.000263,"C3":0,"C6":0,"C7":0,"C1":0.999737,"MC6":0}
{"IPS":0,"IPC":1e-06,"CPI":943884,"Turbo":0.000224,"C0":0.000263,"C3":0,"C6":0,"C7":0,"C1":0.999737,"MC6":0}
{"IPS":5133771863.251443,"IPC":46430731932135.11,"CPI":0,"Turbo":9.6e-05,"C0":0.000111,"C3":0,"C6":0,"C7":0,"C1":0.99989,"MC6":0}
{"IPS":5133771863.251443,"IPC":46430731932135.11,"CPI":0,"Turbo":9.6e-05,"C0":0.000111,"C3":0,"C6":0,"C7":0,"C1":0.99989,"MC6":0}
{"IPS":0.2494,"IPC":2.990874,"CPI":0.33435,"Turbo":0.08337,"C0":0.083387,"C3":0,"C6":0,"C7":0,"C1":0.916613,"MC6":0}
{"IPS":0.2494,"IPC":2.990874,"CPI":0.33435,"Turbo":0.08337,"C0":0.083387,"C3":0,"C6":0,"C7":0,"C1":0.916613,"MC6":0}
{"IPS":2.994897,"IPC":2.994896,"CPI":0.333901,"Turbo":1,"C0":1,"C3":0,"C6":0,"C7":0,"C1":0,"MC6":0}
{"IPS":2.994897,"IPC":2.994896,"CPI":0.333901,"Turbo":1,"C0":1,"C3":0,"C6":0,"C7":0,"C1":0,"MC6":0}
{"IPS":2.995003,"IPC":2.995003,"CPI":0.33389,"Turbo":1,"C0":1,"C3":0,"C6":0,"C7":0,"C1":0,"MC6":0}
{"IPS":2.995003,"IPC":2.995003,"CPI":0.33389,"Turbo":1,"C0":1,"C3":0,"C6":0,"C7":0,"C1":0,"MC6":0}
{"IPS":2.994931,"IPC":2.994931,"CPI":0.333897,"Turbo":1,"C0":1,"C3":0,"C6":0,"C7":0,"C1":0,"MC6":0}
{"IPS":2.994931,"IPC":2.994931,"CPI":0.333897,"Turbo":1,"C0":1,"C3":0,"C6":0,"C7":0,"C1":0,"MC6":0}
{"IPS":2.994943,"IPC":2.994943,"CPI":0.333896,"Turbo":1,"C0":1,"C3":0,"C6":0,"C7":0,"C1":0,"MC6":0}
{"IPS":2.994943,"IPC":2.994943,"CPI":0.333896,"Turbo":1,"C0":1,"C3":0,"C6":0,"C7":0,"C1":0,"MC6":0}
{"IPS":2.993228,"IPC":2.993228,"CPI":0.334087,"Turbo":1,"C0":1,"C3":0,"C6":0,"C7":0,"C1":0,"MC6":0}
{"IPS":2.993228,"IPC":2.993228,"CPI":0.334087,"Turbo":1,"C0":1,"C3":0,"C6":0,"C7":0,"C1":0,"MC6":0}
{"IPS":2.993031,"IPC":2.993031,"CPI":0.334109,"Turbo":1,"C0":1,"C3":0,"C6":0,"C7":0,"C1":0,"MC6":0}
{"IPS":2.993031,"IPC":2.993031,"CPI":0.334109,"Turbo":1,"C0":1,"C3":0,"C6":0,"C7":0,"C1":0,"MC6":0}
I'm checking code and don't understand why counter is read as zero
Beside computing IPC
and CPI
where a test is made to protect a division by zero at:
https://github.com/cyring/CoreFreq/blob/1887c7b554288e3e9e1f359c657c94a68b832be7/corefreqd.c#L1011
It would mean URC
is dropping to zero too !
Can you please show the Core cycles
view
I have not found among previous issues a full report of CoreFreq with Zen gen 1 beside one Epyc/Napple . I don't know if the same happens with 1700X , 1800X and so on
Both turbo and non-turbo core-cycles appear OK
Maybe related... noticed the cli was really slow (much of it not rendering at all) after running overnight. corefreqd was taking 200% CPU. Still at 100% after a restart.
USER PR NI RES SHR S P %CPU %MEM TIME+ COMMAND
root 20 0 1.3m 1.1m S 16 100.1 0.0 12:52.11 corefreqd-cmgr
Hmm.. some fuckery somewhere.. would appear the weird values in corefreq coincides with negative values being reported even with rdmsr
# rd() { rdmsr -ad -p $1 0xC00000E9 |head -24 | head -1; } ;
# { for i in {1..30}; do c=$(rd 31); sleep 0.1; l=$(rd 31); printf "last:%s current:%s diff:%s\n" "$l" "$c" $((l - c)); done ; echo; }
last:369 current:369 diff:0
last:369 current:369 diff:0
last:3264 current:369 diff:2895
last:369 current:369 diff:0
last:369 current:369 diff:0
last:369 current:3264 diff:-2895
last:369 current:369 diff:0
last:369 current:369 diff:0
last:369 current:369 diff:0
last:369 current:369 diff:0
last:368 current:369 diff:-1
last:369 current:3253 diff:-2884
last:369 current:369 diff:0
last:369 current:369 diff:0
last:369 current:369 diff:0
last:369 current:369 diff:0
last:369 current:3253 diff:-2884
last:13965 current:369 diff:13596
last:3253 current:369 diff:2884
last:369 current:369 diff:0
last:369 current:369 diff:0
last:369 current:3253 diff:-2884
last:369 current:369 diff:0
last:2069 current:369 diff:1700
last:369 current:369 diff:0
last:369 current:369 diff:0
last:369 current:3264 diff:-2895
last:369 current:369 diff:0
last:369 current:369 diff:0
last:369 current:369 diff:0
Not sure how that is possible... it's not supposed to reset!
I stand corrected, it IS volatile.
Volatile - Indicates that a register field value may be modified by hardware, firmware, or microcode
when fetching the first instruction and/or might have read or write side effects. No read may
depend on the results of a previous read and no write may be omitted based on the value of a
previous read or write.
What is the point of it then.. single user debugging? Can't infer anything from them.
Both turbo and non-turbo core-cycles appear OK
OK this is at least a good news
Maybe related... noticed the cli was really slow (much of it not rendering at all) after running overnight. corefreqd was taking 200% CPU. Still at 100% after a restart.
Please confirm, if source code is the latest. No mix with all version. Clean rebuild is:
make clean all
USER PR NI RES SHR S P %CPU %MEM TIME+ COMMAND root 20 0 1.3m 1.1m S 16 100.1 0.0 12:52.11 corefreqd-cmgr
CPU is high in the Child process which manages the Stress threads (aka Tools in the Cli) I'm reviewing the code again ...
Hmm.. some fuckery somewhere.. would appear the weird values in corefreq coincides with negative values being reported even with rdmsr
# rd() { rdmsr -ad -p $1 0xC00000E9 |head -24 | head -1; } ; # { for i in {1..30}; do c=$(rd 31); sleep 0.1; l=$(rd 31); printf "last:%s current:%s diff:%s\n" "$l" "$c" $((l - c)); done ; echo; } last:369 current:369 diff:0 last:369 current:369 diff:0 last:3264 current:369 diff:2895 last:369 current:369 diff:0 last:369 current:369 diff:0 last:369 current:3264 diff:-2895 last:369 current:369 diff:0 last:369 current:369 diff:0 last:369 current:369 diff:0 last:369 current:369 diff:0 last:368 current:369 diff:-1 last:369 current:3253 diff:-2884 last:369 current:369 diff:0 last:369 current:369 diff:0 last:369 current:369 diff:0 last:369 current:369 diff:0 last:369 current:3253 diff:-2884 last:13965 current:369 diff:13596 last:3253 current:369 diff:2884 last:369 current:369 diff:0 last:369 current:369 diff:0 last:369 current:3253 diff:-2884 last:369 current:369 diff:0 last:2069 current:369 diff:1700 last:369 current:369 diff:0 last:369 current:369 diff:0 last:369 current:3264 diff:-2895 last:369 current:369 diff:0 last:369 current:369 diff:0 last:369 current:369 diff:0
Not sure how that is possible... it's not supposed to reset!
How can the Instructions fall down to zero. Can a CPU do that !
I stand corrected, it IS volatile.
Volatile - Indicates that a register field value may be modified by hardware, firmware, or microcode when fetching the first instruction and/or might have read or write side effects. No read may depend on the results of a previous read and no write may be omitted based on the value of a previous read or write.
What is the point of it then.. single user debugging? Can't infer anything from them.
You mean I should have declare some variables volatile
in the code ?
EDIT: Volatile
, true, when programming machine code.
https://github.com/cyring/CoreFreq/blob/1887c7b554288e3e9e1f359c657c94a68b832be7/corefreqk.h#L96
But in a script, using rdmsr, there are too much instructions involved to store a counter and have an out of order side effect.
Are you getting the same issue when booting with the CoreFreq Live Image
corefreq_online_update
next corefreq_build_run
Please confirm, if source code is the latest. No mix with all version.
# cat .git/refs/heads/develop 1887c7b554288e3e9e1f359c657c94a68b832be7
You mean I should have declare some variables
volatile
in the code ?
I would hope the value is always a number at least, but it's not monotonic. It's an unreliable value. Except while in turbo?
Re the CPU use of corefreqd it went away after reloading the kernel module.
Just in case: is the microcode the latest available ?
AFAIK.. Unless it's a BIOS//AGESA issue.
F11e BIOS
cat /sys/devices/system/cpu/cpu0/microcode/version
0x8001137
AFAIK.. Unless it's a BIOS//AGESA issue.
F11e BIOS cat /sys/devices/system/cpu/cpu0/microcode/version 0x8001137
One of these BIOS versions ?
Update AGESA 1.1.0.1a
Please try CoreFreq with a stable BIOS
Correct, the bios is running F11e which has AGESA 1.1.01a. There were reports of problems with F12
For shits'n giggles I updated that box to F12, no diff.
Correct, the bios is running F11e which has AGESA 1.1.01a. There were reports of problems with F12
For shits'n giggles I updated that box to F12, no diff.
I've asked other Zen1 users for the same test; so far no answer.
At least I would like to know if a simple Ryzen 1000 also has an issue with the instructions counter. Because I'm stuck now.
I would expect them to based on the documentation from AMD. Maybe they aren't as erratic though. There's no indication the value is even wrapping, it's just bogus / random.
I would expect them to based on the documentation from AMD. Maybe they aren't as erratic though. There's no indication the value is even wrapping, it's just bogus / random.
Among all my recent searches, I read s/o is rather programming one of the general PMC counters to count the retired instructions, rather than using the IRPerf register, because it is buggy. Unfortunately I did not bookmark that page.
Anyway this issue can't be unique to Threadripper but shared among the whole architecture, including EPYC. Do you ?
Agreed. Might be worth looking at how perf is counting cycles? My guess is they're discarding bogus values but how do you interpolate anything if you can't trust the value at all.
Confirmed the same issue here
Beside asking help from the Manufacturer for a new microcode, feel free to close or not this GitHub issue.
Regards, CyrIng
@cyring can you comfirm you were testing with and without corefreqk being loaded? Just noticed something odd in that rdmsr works fine without the corefreqk module.
That isn't to say corefreq is at fault but something during the setup may be a trigger? May even be something stupid like xfr itself.
@cyring can you comfirm you were testing with and without corefreqk being loaded? Just noticed something odd in that rdmsr works fine without the corefreqk module.
That isn't to say corefreq is at fault but something during the setup may be a trigger? May even be something stupid like xfr itself.
It's very good hint !
I bet I have messed up with some registers.
MSR_K7_HWCR
is the register where I'm activating the Instruction counter:
https://github.com/cyring/CoreFreq/blob/933df839bf494061a9488bb600a2fff18bdf6731/corefreqk.c#L8544
Please post the output of this register as soon as counters are OK
rdmsr -aX 0xc0010015
Thank you
EDIT: there's a slight variation between EPYC and Ryzen: EPYC is a per CCD register
# rdmsr -aX 0xc0010015 | uniq -c
32 49000011
# modprobe -r corefreqk
# rdmsr -aX 0xc0010015 | uniq -c
32 9000011
#
Plot thickens: disabling C6 works around it too (CC6_Enable=0 PC6_Enable=0)
Early ryzen had issues with C6 but afaik those surfaced as hard lockups, I don't see any reference to other instabilities though I've long suspected there was other errata.
Do you confirm when the HWCR has the value 0x9000011
you are able to read good counters ?
If true, it means the bit 30 for IRPerfEn does not have to be set !
Correct though I'm assuming that bit should have flipped back on exit?
This is such a cluster fuck because even more odd though is the read doesn't change at all after.
# # lsmod |grep corefreq
#
# ./wtftest.sh 1
last:18568480 current:18568480 diff:0
last:18568480 current:18568480 diff:0
last:18568480 current:18568480 diff:0
last:18568480 current:18568480 diff:0
last:18568480 current:18568480 diff:0
last:18568480 current:18568480 diff:0
last:18568480 current:18568480 diff:0
last:18568480 current:18568480 diff:0
last:18568480 current:18568480 diff:0
last:18568480 current:18568480 diff:0
last:18568480 current:18568480 diff:0
# ./c6 enable
# ./wtftest.sh 1
last:0 current:0 diff:0
last:0 current:0 diff:0
last:0 current:0 diff:0
last:0 current:0 diff:0
last:0 current:0 diff:0
last:0 current:0 diff:0
last:0 current:0 diff:0
last:0 current:0 diff:0
last:0 current:0 diff:0
last:0 current:0 diff:0
last:0 current:0 diff:0
last:0 current:0 diff:0
last:0 current:0 diff:0
last:0 current:0 diff:0
last:0 current:0 diff:0
last:0 current:0 diff:0
last:0 current:0 diff:0
#
Do you confirm no other Zen software is changing registers on behalf of CoreFreq ?
Can you provide me an access to debug online ?
Holyshit that is some serious erata. No fix from them?? Yikes. Weird thing is it doesn't appear load related at all given it surfaces on cpu's that are IDLE. If it is overflowing doesn't that throw turbo completely out the window too? What a clusterfuck.
Except if disabling CC6 makes things better (see previous comment), feel free to close the issue. Cyril
AMD Errata
News
Global C-States
in this EPYC BIOS is fixing the bogus Instructions counter.
tbh Not sure if this is a bug with corefreq or the platform? Seems pretty random bouncing around different cores.