Closed Chlorophytus closed 1 year ago
- using 4 DIMMs of Team Group T-CREATE Expert DDR4-3600 memory. These are 32GB DIMMs
Hello, You mean CoreFreq should list 4 sticks of the following 32 GB DIMM specs, for a total of 128 GB ?
Please also provides:
corefreq-cli -s -n -m -n -k -n -B
@Chlorophytus : Hello,
As a non regression test, can you please post the CoreFreq Memory Controller output of your Ryzen Threadripper 3960X using latest version ?
@Chlorophytus : Hello,
As a non regression test, can you please post the CoreFreq Memory Controller output of your Ryzen Threadripper 3960X using latest version ?
Hmmm 1.96.0 built still gives that bad row count. The text output is very large so I put it in a Gist.
./corefreq-cli -M
Zen UMC [1490]
Controller #0 Quad Channel
Bus Rate 1800 MHz Bus Speed 1799 MHz DDR4 Speed 3599 MT/s
Cha CL RCDr RCDw RP RAS RC RRDs RRDl FAW WTRs WTRl WR clRR clWW
#0 18 22 22 22 42 82 6 9 38 5 14 26 5 5
#1 18 22 22 22 42 82 6 9 38 5 14 26 5 5
#2 18 22 22 22 42 82 6 9 38 5 14 26 5 5
#3 18 22 22 22 42 82 6 9 38 5 14 26 5 5
CWL RTP RdWr WrRd scWW sdWW ddWW scRR sdRR ddRR drRR drWW drWR drRRD
#0 18 14 7 3 1 7 6 1 5 4 0 0 0 0
#1 18 14 8 3 1 7 6 1 5 4 0 0 0 0
#2 18 14 7 3 1 7 6 1 5 4 0 0 0 0
#3 18 14 8 3 1 7 6 1 5 4 0 0 0 0
REFI RFC1 RFC2 RFC4 RCPB RPPB BGS:Alt Ban Page CKE CMD GDM ECC
#0 14029 312 192 132 0 0 OFF ON R1W1 0 0 1T ON 0
#1 14029 312 192 132 0 0 OFF ON R1W1 0 0 1T ON 0
#2 14029 312 192 132 0 0 OFF ON R1W1 0 0 1T ON 0
#3 14029 312 192 132 0 0 OFF ON R1W1 0 0 1T ON 0
MRD:PDA MOD:PDA WRMPR STAG PDM RDDATA WRD WRL RDL XS XP CPDED
#0 8 18 27 27 24 255 0:F:0 13 2 13 26 1008 11 4
#1 8 18 27 27 24 255 0:F:0 13 2 13 26 1008 11 4
#2 8 18 27 27 24 255 0:F:0 13 2 13 26 1008 11 4
#3 8 18 27 27 24 255 0:F:0 13 2 13 26 1008 11 4
DIMM Geometry for channel #0
Slot Bank Rank Rows Columns Memory Size (MB)
#0
#1 16 1 32768 1024 4096 TEAMGROUP-UD4-3600
DIMM Geometry for channel #1
Slot Bank Rank Rows Columns Memory Size (MB)
#0
#1 16 1 32768 1024 4096 TEAMGROUP-UD4-3600
DIMM Geometry for channel #2
Slot Bank Rank Rows Columns Memory Size (MB)
#0
#1 16 2 131072 1024 32768 TEAMGROUP-UD4-3600
DIMM Geometry for channel #3
Slot Bank Rank Rows Columns Memory Size (MB)
#0
#1 16 2 131072 1024 32768 TEAMGROUP-UD4-3600
Forgot about the -M flag output in 1.96.0
- using 4 DIMMs of Team Group T-CREATE Expert DDR4-3600 memory. These are 32GB DIMMs
Hello, You mean CoreFreq should list 4 sticks of the following 32 GB DIMM specs, for a total of 128 GB ?
Please also provides:
corefreq-cli -s -n -m -n -k -n -B
EDIT: Those are the DIMMs I am using.
It was OK in wiki/AMD Ryzen Threadripper 3960X
Some chamges have introduced a regression, I not found the commit yet.
Back to the implementation history, commit 33f10208cf0c8383a7e3e6ff9c01023339202815 was giving a correct topology with your Threadripper: can you please check again ?
EDIT: this versions diff
Also please make sure if issue is persistent across several corefreqk.ko
startup ?
Because I'm suspecting Registers may give different values among the two Threadripper dies.
You can force binding the UMC query to a Core, for example the CPU number 0, with the following parameter:
insmod corefreqk.ko ServiceProcessor=0
Setting the ServiceProcessor
number doesn't seem to fix the row count.
EDIT: Let me see if that commit you mentioned fixes it.
Using that diff gives me an error.
make
make -j1 -C /lib/modules/6.2.8-1-default/build M=/home/accelshark/Documents/CoreFreq modules
CC [M] /home/accelshark/Documents/CoreFreq/corefreqk.o
/home/accelshark/Documents/CoreFreq/corefreqk.c: In function ‘CoreFreqK_Create_Device_Level_Up’:
/home/accelshark/Documents/CoreFreq/corefreqk.c:21181:35: error: assignment to ‘char * (*)(const struct device *, umode_t *)’ {aka ‘char * (*)(const struct device *, short unsigned int *)’} from incompatible pointer type ‘char * (*)(struct device *, umode_t *)’ {aka ‘char * (*)(struct device *, short unsigned int *)’} [-Werror=incompatible-pointer-types]
21181 | CoreFreqK.clsdev->devnode = CoreFreqK_DevNode;
| ^
compilation terminated due to -Wfatal-errors.
cc1: some warnings being treated as errors
make[2]: *** [/usr/src/linux-6.2.8-1/scripts/Makefile.build:253: /home/accelshark/Documents/CoreFreq/corefreqk.o] Error 1
make[1]: *** [../../../linux-6.2.8-1/Makefile:2036: /home/accelshark/Documents/CoreFreq] Error 2
make: *** [Makefile:85: all] Error 2
Using that diff gives me an error.
Perhaps usings tags: it should be around 1.91.5
With kernel 6, it is now difficult to go backward. Thus I would suggest to try in reverse order like 1.94.# ... 1.93.# ... 1.92.# ... 1.91.#
I'm aware it's a lot of testing but so far, despite a code review, I have no clue where a change has impacted the decoder. EPYC Rome, with same cpuid as the Castle Peak, should also be concerned: I have no more access to those hardware.
Using that diff gives me an error.
Perhaps usings tags: it should be around
1.91.5
With kernel 6, it is now difficult to go backward. Thus I would suggest to try in reverse order like
1.94.# ... 1.93.# ... 1.92.# ... 1.91.#
I'm aware it's a lot of testing but so far, despite a code review, I have no clue where a change has impacted the decoder. EPYC Rome, with same cpuid as the Castle Peak, should also be concerned: I have no more access to those hardware.
I've bisected and done testing on a relatively big Git repository to find a bug before, I will do it in a day or so.
Using that diff gives me an error.
Perhaps usings tags: it should be around
1.91.5
With kernel 6, it is now difficult to go backward. Thus I would suggest to try in reverse order like
1.94.# ... 1.93.# ... 1.92.# ... 1.91.#
I'm aware it's a lot of testing but so far, despite a code review, I have no clue where a change has impacted the decoder. EPYC Rome, with same cpuid as the Castle Peak, should also be concerned: I have no more access to those hardware.
Used 1.94.1 and the bug is there. The earlier version tags (1.93.# and earlier) do not compile on openSUSE Linux's kernel 6.2.
EDIT: I might be able to use git blame to determine what changed in checking memory information, but I don't know where to look in the code.
Using that diff gives me an error.
Perhaps usings tags: it should be around
1.91.5
With kernel 6, it is now difficult to go backward. Thus I would suggest to try in reverse order like1.94.# ... 1.93.# ... 1.92.# ... 1.91.#
I'm aware it's a lot of testing but so far, despite a code review, I have no clue where a change has impacted the decoder. EPYC Rome, with same cpuid as the Castle Peak, should also be concerned: I have no more access to those hardware.Used 1.94.1 and the bug is there. The earlier version tags (1.93.# and earlier) do not compile on openSUSE Linux's kernel 6.2.
EDIT: I might be able to use git blame to determine what changed in checking memory information, but I don't know where to look in the code.
Change happened while I was fixing code to support DDR5 brought by Zen4.
Can you post the UMC decoding from zencli ?
zencli umc
Using that diff gives me an error.
Perhaps usings tags: it should be around
1.91.5
With kernel 6, it is now difficult to go backward. Thus I would suggest to try in reverse order like1.94.# ... 1.93.# ... 1.92.# ... 1.91.#
I'm aware it's a lot of testing but so far, despite a code review, I have no clue where a change has impacted the decoder. EPYC Rome, with same cpuid as the Castle Peak, should also be concerned: I have no more access to those hardware.Used 1.94.1 and the bug is there. The earlier version tags (1.93.# and earlier) do not compile on openSUSE Linux's kernel 6.2. EDIT: I might be able to use git blame to determine what changed in checking memory information, but I don't know where to look in the code.
Change happened while I was fixing code to support DDR5 brought by Zen4.
Can you post the UMC decoding from zencli ?
zencli umc
Data Fabric: scanning UMC @ BAR[0x00050000] : 0 1 2 3 4 5 6 7 for 4 Channels
CHA[0] CHIP[0:0] @ 0x00250000[0x00000000] Disable, Rank=0
CHA[0] MASK[0:0] @ 0x00250020[0x00000000]
CHA[0] CHIP[0:1] @ 0x00250010[0x00000000] Disable, Rank=0
CHA[0] MASK[0:1] @ 0x00250028[0x00000000]
CHA[0] CHIP[1:0] @ 0x00250004[0x00000000] Disable, Rank=0
CHA[0] MASK[1:0] @ 0x00250020[0x00000000]
CHA[0] CHIP[1:1] @ 0x00250014[0x00000000] Disable, Rank=0
CHA[0] MASK[1:1] @ 0x00250028[0x00000000]
CHA[0] CHIP[2:0] @ 0x00250008[0x00000001] Enable, Rank=2
CHA[0] MASK[2:0] @ 0x00250024[0x07fffdfe] ChipSize[16777216]
CHA[0] CHIP[2:1] @ 0x00250018[0x00000000] Disable, Rank=0
CHA[0] MASK[2:1] @ 0x0025002c[0x00000000]
CHA[0] CHIP[3:0] @ 0x0025000c[0x00000201] Enable, Rank=2
CHA[0] MASK[3:0] @ 0x00250024[0x07fffdfe] ChipSize[16777216]
CHA[0] CHIP[3:1] @ 0x00250018[0x00000000] Disable, Rank=0
CHA[0] MASK[3:1] @ 0x0025002c[0x00000000]
DIMM Size[33554432 KB] [32768 MB]
CHA[1] CHIP[0:0] @ 0x00350000[0x00000000] Disable, Rank=0
CHA[1] MASK[0:0] @ 0x00350020[0x00000000]
CHA[1] CHIP[0:1] @ 0x00350010[0x00000000] Disable, Rank=0
CHA[1] MASK[0:1] @ 0x00350028[0x00000000]
CHA[1] CHIP[1:0] @ 0x00350004[0x00000000] Disable, Rank=0
CHA[1] MASK[1:0] @ 0x00350020[0x00000000]
CHA[1] CHIP[1:1] @ 0x00350014[0x00000000] Disable, Rank=0
CHA[1] MASK[1:1] @ 0x00350028[0x00000000]
CHA[1] CHIP[2:0] @ 0x00350008[0x00000001] Enable, Rank=2
CHA[1] MASK[2:0] @ 0x00350024[0x07fffdfe] ChipSize[16777216]
CHA[1] CHIP[2:1] @ 0x00350018[0x00000000] Disable, Rank=0
CHA[1] MASK[2:1] @ 0x0035002c[0x00000000]
CHA[1] CHIP[3:0] @ 0x0035000c[0x00000201] Enable, Rank=2
CHA[1] MASK[3:0] @ 0x00350024[0x07fffdfe] ChipSize[16777216]
CHA[1] CHIP[3:1] @ 0x00350018[0x00000000] Disable, Rank=0
CHA[1] MASK[3:1] @ 0x0035002c[0x00000000]
DIMM Size[33554432 KB] [32768 MB]
CHA[2] CHIP[0:0] @ 0x00450000[0x00000000] Disable, Rank=0
CHA[2] MASK[0:0] @ 0x00450020[0x00000000]
CHA[2] CHIP[0:1] @ 0x00450010[0x00000000] Disable, Rank=0
CHA[2] MASK[0:1] @ 0x00450028[0x00000000]
CHA[2] CHIP[1:0] @ 0x00450004[0x00000000] Disable, Rank=0
CHA[2] MASK[1:0] @ 0x00450020[0x00000000]
CHA[2] CHIP[1:1] @ 0x00450014[0x00000000] Disable, Rank=0
CHA[2] MASK[1:1] @ 0x00450028[0x00000000]
CHA[2] CHIP[2:0] @ 0x00450008[0x00000001] Enable, Rank=2
CHA[2] MASK[2:0] @ 0x00450024[0x07fffdfe] ChipSize[16777216]
CHA[2] CHIP[2:1] @ 0x00450018[0x00000000] Disable, Rank=0
CHA[2] MASK[2:1] @ 0x0045002c[0x00000000]
CHA[2] CHIP[3:0] @ 0x0045000c[0x00000201] Enable, Rank=2
CHA[2] MASK[3:0] @ 0x00450024[0x07fffdfe] ChipSize[16777216]
CHA[2] CHIP[3:1] @ 0x00450018[0x00000000] Disable, Rank=0
CHA[2] MASK[3:1] @ 0x0045002c[0x00000000]
DIMM Size[33554432 KB] [32768 MB]
CHA[3] CHIP[0:0] @ 0x00550000[0x00000000] Disable, Rank=0
CHA[3] MASK[0:0] @ 0x00550020[0x00000000]
CHA[3] CHIP[0:1] @ 0x00550010[0x00000000] Disable, Rank=0
CHA[3] MASK[0:1] @ 0x00550028[0x00000000]
CHA[3] CHIP[1:0] @ 0x00550004[0x00000000] Disable, Rank=0
CHA[3] MASK[1:0] @ 0x00550020[0x00000000]
CHA[3] CHIP[1:1] @ 0x00550014[0x00000000] Disable, Rank=0
CHA[3] MASK[1:1] @ 0x00550028[0x00000000]
CHA[3] CHIP[2:0] @ 0x00550008[0x00000001] Enable, Rank=2
CHA[3] MASK[2:0] @ 0x00550024[0x07fffdfe] ChipSize[16777216]
CHA[3] CHIP[2:1] @ 0x00550018[0x00000000] Disable, Rank=0
CHA[3] MASK[2:1] @ 0x0055002c[0x00000000]
CHA[3] CHIP[3:0] @ 0x0055000c[0x00000201] Enable, Rank=2
CHA[3] MASK[3:0] @ 0x00550024[0x07fffdfe] ChipSize[16777216]
CHA[3] CHIP[3:1] @ 0x00550018[0x00000000] Disable, Rank=0
CHA[3] MASK[3:1] @ 0x0055002c[0x00000000]
DIMM Size[33554432 KB] [32768 MB]
Hello,
A fix is available. Can you please pull master and post the UMC ?
Hello,
A fix is available. Can you please pull master and post the UMC ?
The first two channels don't show up anymore.
$ ./corefreq-cli -M
Zen UMC [1490]
Controller #0 Quad Channel
Bus Rate 1800 MHz Bus Speed 1799 MHz DDR4 Speed 3599 MT/s
Cha CL RCDr RCDw RP RAS RC RRDs RRDl FAW WTRs WTRl WR clRR clWW
#0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#2 18 22 22 22 42 82 6 9 38 5 14 26 5 5
#3 18 22 22 22 42 82 6 9 38 5 14 26 5 5
CWL RTP RdWr WrRd scWW sdWW ddWW scRR sdRR ddRR drRR drWW drWR drRRD
#0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#2 18 14 7 3 1 7 6 1 5 4 0 0 0 0
#3 18 14 8 3 1 7 6 1 5 4 0 0 0 0
REFI RFC1 RFC2 RFC4 RCPB RPPB BGS:Alt Ban Page CKE CMD GDM ECC
#0 0 0 0 0 0 0 ON OFF R0W0 0 0 1T OFF 0
#1 0 0 0 0 0 0 ON OFF R0W0 0 0 1T OFF 0
#2 14029 312 192 132 0 0 OFF ON R1W1 0 0 1T ON 0
#3 14029 312 192 132 0 0 OFF ON R1W1 0 0 1T ON 0
MRD:PDA MOD:PDA WRMPR STAG PDM RDDATA WRD WRL RDL XS XP CPDED
#0 0 0 0 0 0 0 0:F:0 0 0 0 0 0 0 0
#1 0 0 0 0 0 0 0:F:0 0 0 0 0 0 0 0
#2 8 18 27 27 24 255 0:F:0 13 2 13 26 1008 11 4
#3 8 18 27 27 24 255 0:F:0 13 2 13 26 1008 11 4
DIMM Geometry for channel #0
Slot Bank Rank Rows Columns Memory Size (MB)
#0
#1
DIMM Geometry for channel #1
Slot Bank Rank Rows Columns Memory Size (MB)
#0
#1
DIMM Geometry for channel #2
Slot Bank Rank Rows Columns Memory Size (MB)
#0
#1 16 2 131072 1024 32768 TEAMGROUP-UD4-3600
DIMM Geometry for channel #3
Slot Bank Rank Rows Columns Memory Size (MB)
#0
#1 16 2 131072 1024 32768 TEAMGROUP-UD4-3600
The first two channels don't show up anymore.
I see what's going on: your TR does not count the first enabled UMC at channel 0 but at channel 2. It is still quad channels. Thus I have to shift the whole topology based on channel zero without introducing regressions with other Zen, EPYC architectures tested so far.
The first two channels don't show up anymore.
I see what's going on: your TR does not count the first enabled UMC at channel 0 but at channel 2. It is still quad channels. Thus I have to shift the whole topology based on channel zero without introducing regressions with other Zen, EPYC architectures tested so far.
I wonder if that has to do with WRX80 Threadripper PRO CPUs having 8 memory channels. I can't test that though.
I found my queries were not based on the right UMC base address register
I have force push a new change in master. Can you give it a try ?
I found my queries were not based on the right UMC base address register
I have force push a new change in master. Can you give it a try ?
You fixed it.
Think I'll close this. I'm not sure what else needs to be done and this has a "solved" tag.
Reopen this if anything comes up.
Hello,
Could you please give a try to the latest version, in particular the output of the Memory Controller for any regression ?
Hello,
Could you please give a try to the latest version, in particular the output of the Memory Controller for any regression ?
Sure, which commit?
Hello, Could you please give a try to the latest version, in particular the output of the Memory Controller for any regression ?
Sure, which commit?
Just the latest, please.
Hello, Could you please give a try to the latest version, in particular the output of the Memory Controller for any regression ?
Sure, which commit?
Just the latest, please.
@cyring The memory controller seems to be outputting the correct results
@cyring The memory controller seems to be outputting the correct results
Great, thanks a lot for your answer.
I have restarted the computer but the bad row count is still present.