cyring / CoreFreq

CoreFreq : CPU monitoring and tuning software designed for 64-bit processors.
https://www.cyring.fr
GNU General Public License v2.0
1.98k stars 126 forks source link

[Solved] TRX40/Zen2: bad row count on DIMM channels 0 and 1 #430

Closed Chlorophytus closed 1 year ago

Chlorophytus commented 1 year ago

image

$ corefreq-cli -M
                              Zen UMC  [1490]                              
Controller #0                                                Quad Channel  
 Bus Rate  1800 MHz       Bus Speed 1799 MHz           DDR4 Speed 3599 MT/s

 Cha   CL  RCDr RCDw  RP  RAS   RC  RRDs RRDl FAW  WTRs WTRl  WR  clRR clWW
  #0   18   22   22   22   42   82    6    9   38    5   14   26    5    5 
  #1   18   22   22   22   42   82    6    9   38    5   14   26    5    5 
  #2   18   22   22   22   42   82    6    9   38    5   14   26    5    5 
  #3   18   22   22   22   42   82    6    9   38    5   14   26    5    5 
      CWL  RTP RdWr WrRd scWW sdWW ddWW scRR sdRR ddRR drRR drWW drWR drRRD
  #0   18   14    7    3    1    7    6    1    5    4    0    0    0    0 
  #1   18   14    8    3    1    7    6    1    5    4    0    0    0    0 
  #2   18   14    7    3    1    7    6    1    5    4    0    0    0    0 
  #3   18   14    8    3    1    7    6    1    5    4    0    0    0    0 
      REFI RFC1 RFC2 RFC4 RCPB RPPB  BGS:Alt  Ban  Page  CKE  CMD  GDM  ECC
  #0 14029  312  192  132   0    0   OFF  ON  R1W1   0    0   1T    ON   0 
  #1 14029  312  192  132   0    0   OFF  ON  R1W1   0    0   1T    ON   0 
  #2 14029  312  192  132   0    0   OFF  ON  R1W1   0    0   1T    ON   0 
  #3 14029  312  192  132   0    0   OFF  ON  R1W1   0    0   1T    ON   0 
      MRD:PDA   MOD:PDA  WRMPR STAG PDM RDDATA WRD  WRL  RDL  XS   XP CPDED
  #0    8  18    27  27    24  255 0:F:0   13   2   13   26 1008   11    4 
  #1    8  18    27  27    24  255 0:F:0   13   2   13   26 1008   11    4 
  #2    8  18    27  27    24  255 0:F:0   13   2   13   26 1008   11    4 
  #3    8  18    27  27    24  255 0:F:0   13   2   13   26 1008   11    4 

 DIMM Geometry for channel #0                                              
      Slot Bank Rank     Rows   Columns    Memory Size (MB)                
       #0                                                                  
       #1    16    1     32768      1024           4096  TEAMGROUP-UD4-3600
 DIMM Geometry for channel #1                                              
      Slot Bank Rank     Rows   Columns    Memory Size (MB)                
       #0                                                                  
       #1    16    1     32768      1024           4096  TEAMGROUP-UD4-3600
 DIMM Geometry for channel #2                                              
      Slot Bank Rank     Rows   Columns    Memory Size (MB)                
       #0                                                                  
       #1    16    2    131072      1024          32768  TEAMGROUP-UD4-3600
 DIMM Geometry for channel #3                                              
      Slot Bank Rank     Rows   Columns    Memory Size (MB)                
       #0                                                                  
       #1    16    2    131072      1024          32768  TEAMGROUP-UD4-3600

I have restarted the computer but the bad row count is still present.

cyring commented 1 year ago
  • using 4 DIMMs of Team Group T-CREATE Expert DDR4-3600 memory. These are 32GB DIMMs

Hello, You mean CoreFreq should list 4 sticks of the following 32 GB DIMM specs, for a total of 128 GB ?

TTCED464G3600HC18JDC01

Please also provides:

corefreq-cli -s -n -m -n -k -n -B
cyring commented 1 year ago

@Chlorophytus : Hello,

As a non regression test, can you please post the CoreFreq Memory Controller output of your Ryzen Threadripper 3960X using latest version ?

Chlorophytus commented 1 year ago

@Chlorophytus : Hello,

As a non regression test, can you please post the CoreFreq Memory Controller output of your Ryzen Threadripper 3960X using latest version ?

Hmmm 1.96.0 built still gives that bad row count. The text output is very large so I put it in a Gist.

https://gist.githubusercontent.com/Chlorophytus/98a9c9c311b3e6e14a08b787b5dcf44e/raw/6b5a95bb3f2021c4afbe45dbc12ab660cd15e5d9/corefreq-1.96.0-output.txt

Chlorophytus commented 1 year ago
./corefreq-cli -M
                              Zen UMC  [1490]                              
Controller #0                                                Quad Channel  
 Bus Rate  1800 MHz       Bus Speed 1799 MHz           DDR4 Speed 3599 MT/s

 Cha   CL  RCDr RCDw  RP  RAS   RC  RRDs RRDl FAW  WTRs WTRl  WR  clRR clWW
  #0   18   22   22   22   42   82    6    9   38    5   14   26    5    5 
  #1   18   22   22   22   42   82    6    9   38    5   14   26    5    5 
  #2   18   22   22   22   42   82    6    9   38    5   14   26    5    5 
  #3   18   22   22   22   42   82    6    9   38    5   14   26    5    5 
      CWL  RTP RdWr WrRd scWW sdWW ddWW scRR sdRR ddRR drRR drWW drWR drRRD
  #0   18   14    7    3    1    7    6    1    5    4    0    0    0    0 
  #1   18   14    8    3    1    7    6    1    5    4    0    0    0    0 
  #2   18   14    7    3    1    7    6    1    5    4    0    0    0    0 
  #3   18   14    8    3    1    7    6    1    5    4    0    0    0    0 
      REFI RFC1 RFC2 RFC4 RCPB RPPB  BGS:Alt  Ban  Page  CKE  CMD  GDM  ECC
  #0 14029  312  192  132   0    0   OFF  ON  R1W1   0    0   1T    ON   0 
  #1 14029  312  192  132   0    0   OFF  ON  R1W1   0    0   1T    ON   0 
  #2 14029  312  192  132   0    0   OFF  ON  R1W1   0    0   1T    ON   0 
  #3 14029  312  192  132   0    0   OFF  ON  R1W1   0    0   1T    ON   0 
      MRD:PDA   MOD:PDA  WRMPR STAG PDM RDDATA WRD  WRL  RDL  XS   XP CPDED
  #0    8  18    27  27    24  255 0:F:0   13   2   13   26 1008   11    4 
  #1    8  18    27  27    24  255 0:F:0   13   2   13   26 1008   11    4 
  #2    8  18    27  27    24  255 0:F:0   13   2   13   26 1008   11    4 
  #3    8  18    27  27    24  255 0:F:0   13   2   13   26 1008   11    4 

 DIMM Geometry for channel #0                                              
      Slot Bank Rank     Rows   Columns    Memory Size (MB)                
       #0                                                                  
       #1    16    1     32768      1024           4096  TEAMGROUP-UD4-3600
 DIMM Geometry for channel #1                                              
      Slot Bank Rank     Rows   Columns    Memory Size (MB)                
       #0                                                                  
       #1    16    1     32768      1024           4096  TEAMGROUP-UD4-3600
 DIMM Geometry for channel #2                                              
      Slot Bank Rank     Rows   Columns    Memory Size (MB)                
       #0                                                                  
       #1    16    2    131072      1024          32768  TEAMGROUP-UD4-3600
 DIMM Geometry for channel #3                                              
      Slot Bank Rank     Rows   Columns    Memory Size (MB)                
       #0                                                                  
       #1    16    2    131072      1024          32768  TEAMGROUP-UD4-3600

Forgot about the -M flag output in 1.96.0

  • using 4 DIMMs of Team Group T-CREATE Expert DDR4-3600 memory. These are 32GB DIMMs

Hello, You mean CoreFreq should list 4 sticks of the following 32 GB DIMM specs, for a total of 128 GB ?

TTCED464G3600HC18JDC01

Please also provides:

corefreq-cli -s -n -m -n -k -n -B

EDIT: Those are the DIMMs I am using.

cyring commented 1 year ago

It was OK in wiki/AMD Ryzen Threadripper 3960X Some chamges have introduced a regression, I not found the commit yet.

cyring commented 1 year ago

Back to the implementation history, commit 33f10208cf0c8383a7e3e6ff9c01023339202815 was giving a correct topology with your Threadripper: can you please check again ?

EDIT: this versions diff

Also please make sure if issue is persistent across several corefreqk.ko startup ? Because I'm suspecting Registers may give different values among the two Threadripper dies. You can force binding the UMC query to a Core, for example the CPU number 0, with the following parameter:

insmod corefreqk.ko ServiceProcessor=0
Chlorophytus commented 1 year ago

Setting the ServiceProcessor number doesn't seem to fix the row count.

EDIT: Let me see if that commit you mentioned fixes it.

Chlorophytus commented 1 year ago

Using that diff gives me an error.

make 
make -j1 -C /lib/modules/6.2.8-1-default/build M=/home/accelshark/Documents/CoreFreq modules
  CC [M]  /home/accelshark/Documents/CoreFreq/corefreqk.o
/home/accelshark/Documents/CoreFreq/corefreqk.c: In function ‘CoreFreqK_Create_Device_Level_Up’:
/home/accelshark/Documents/CoreFreq/corefreqk.c:21181:35: error: assignment to ‘char * (*)(const struct device *, umode_t *)’ {aka ‘char * (*)(const struct device *, short unsigned int *)’} from incompatible pointer type ‘char * (*)(struct device *, umode_t *)’ {aka ‘char * (*)(struct device *, short unsigned int *)’} [-Werror=incompatible-pointer-types]
21181 |         CoreFreqK.clsdev->devnode = CoreFreqK_DevNode;
      |                                   ^
compilation terminated due to -Wfatal-errors.
cc1: some warnings being treated as errors
make[2]: *** [/usr/src/linux-6.2.8-1/scripts/Makefile.build:253: /home/accelshark/Documents/CoreFreq/corefreqk.o] Error 1
make[1]: *** [../../../linux-6.2.8-1/Makefile:2036: /home/accelshark/Documents/CoreFreq] Error 2
make: *** [Makefile:85: all] Error 2
cyring commented 1 year ago

Using that diff gives me an error.

Perhaps usings tags: it should be around 1.91.5

With kernel 6, it is now difficult to go backward. Thus I would suggest to try in reverse order like 1.94.# ... 1.93.# ... 1.92.# ... 1.91.#

I'm aware it's a lot of testing but so far, despite a code review, I have no clue where a change has impacted the decoder. EPYC Rome, with same cpuid as the Castle Peak, should also be concerned: I have no more access to those hardware.

Chlorophytus commented 1 year ago

Using that diff gives me an error.

Perhaps usings tags: it should be around 1.91.5

With kernel 6, it is now difficult to go backward. Thus I would suggest to try in reverse order like 1.94.# ... 1.93.# ... 1.92.# ... 1.91.#

I'm aware it's a lot of testing but so far, despite a code review, I have no clue where a change has impacted the decoder. EPYC Rome, with same cpuid as the Castle Peak, should also be concerned: I have no more access to those hardware.

I've bisected and done testing on a relatively big Git repository to find a bug before, I will do it in a day or so.

Chlorophytus commented 1 year ago

Using that diff gives me an error.

Perhaps usings tags: it should be around 1.91.5

With kernel 6, it is now difficult to go backward. Thus I would suggest to try in reverse order like 1.94.# ... 1.93.# ... 1.92.# ... 1.91.#

I'm aware it's a lot of testing but so far, despite a code review, I have no clue where a change has impacted the decoder. EPYC Rome, with same cpuid as the Castle Peak, should also be concerned: I have no more access to those hardware.

Used 1.94.1 and the bug is there. The earlier version tags (1.93.# and earlier) do not compile on openSUSE Linux's kernel 6.2.

EDIT: I might be able to use git blame to determine what changed in checking memory information, but I don't know where to look in the code.

cyring commented 1 year ago

Using that diff gives me an error.

Perhaps usings tags: it should be around 1.91.5 With kernel 6, it is now difficult to go backward. Thus I would suggest to try in reverse order like 1.94.# ... 1.93.# ... 1.92.# ... 1.91.# I'm aware it's a lot of testing but so far, despite a code review, I have no clue where a change has impacted the decoder. EPYC Rome, with same cpuid as the Castle Peak, should also be concerned: I have no more access to those hardware.

Used 1.94.1 and the bug is there. The earlier version tags (1.93.# and earlier) do not compile on openSUSE Linux's kernel 6.2.

EDIT: I might be able to use git blame to determine what changed in checking memory information, but I don't know where to look in the code.

Change happened while I was fixing code to support DDR5 brought by Zen4.

Can you post the UMC decoding from zencli ?

zencli umc
Chlorophytus commented 1 year ago

Using that diff gives me an error.

Perhaps usings tags: it should be around 1.91.5 With kernel 6, it is now difficult to go backward. Thus I would suggest to try in reverse order like 1.94.# ... 1.93.# ... 1.92.# ... 1.91.# I'm aware it's a lot of testing but so far, despite a code review, I have no clue where a change has impacted the decoder. EPYC Rome, with same cpuid as the Castle Peak, should also be concerned: I have no more access to those hardware.

Used 1.94.1 and the bug is there. The earlier version tags (1.93.# and earlier) do not compile on openSUSE Linux's kernel 6.2. EDIT: I might be able to use git blame to determine what changed in checking memory information, but I don't know where to look in the code.

Change happened while I was fixing code to support DDR5 brought by Zen4.

Can you post the UMC decoding from zencli ?

zencli umc
Data Fabric: scanning UMC @ BAR[0x00050000] : 0 1 2 3 4 5 6 7 for 4 Channels

CHA[0] CHIP[0:0] @ 0x00250000[0x00000000] Disable, Rank=0
CHA[0] MASK[0:0] @ 0x00250020[0x00000000]
CHA[0] CHIP[0:1] @ 0x00250010[0x00000000] Disable, Rank=0
CHA[0] MASK[0:1] @ 0x00250028[0x00000000]
CHA[0] CHIP[1:0] @ 0x00250004[0x00000000] Disable, Rank=0
CHA[0] MASK[1:0] @ 0x00250020[0x00000000]
CHA[0] CHIP[1:1] @ 0x00250014[0x00000000] Disable, Rank=0
CHA[0] MASK[1:1] @ 0x00250028[0x00000000]
CHA[0] CHIP[2:0] @ 0x00250008[0x00000001] Enable, Rank=2
CHA[0] MASK[2:0] @ 0x00250024[0x07fffdfe] ChipSize[16777216]
CHA[0] CHIP[2:1] @ 0x00250018[0x00000000] Disable, Rank=0
CHA[0] MASK[2:1] @ 0x0025002c[0x00000000]
CHA[0] CHIP[3:0] @ 0x0025000c[0x00000201] Enable, Rank=2
CHA[0] MASK[3:0] @ 0x00250024[0x07fffdfe] ChipSize[16777216]
CHA[0] CHIP[3:1] @ 0x00250018[0x00000000] Disable, Rank=0
CHA[0] MASK[3:1] @ 0x0025002c[0x00000000]

DIMM Size[33554432 KB] [32768 MB]

CHA[1] CHIP[0:0] @ 0x00350000[0x00000000] Disable, Rank=0
CHA[1] MASK[0:0] @ 0x00350020[0x00000000]
CHA[1] CHIP[0:1] @ 0x00350010[0x00000000] Disable, Rank=0
CHA[1] MASK[0:1] @ 0x00350028[0x00000000]
CHA[1] CHIP[1:0] @ 0x00350004[0x00000000] Disable, Rank=0
CHA[1] MASK[1:0] @ 0x00350020[0x00000000]
CHA[1] CHIP[1:1] @ 0x00350014[0x00000000] Disable, Rank=0
CHA[1] MASK[1:1] @ 0x00350028[0x00000000]
CHA[1] CHIP[2:0] @ 0x00350008[0x00000001] Enable, Rank=2
CHA[1] MASK[2:0] @ 0x00350024[0x07fffdfe] ChipSize[16777216]
CHA[1] CHIP[2:1] @ 0x00350018[0x00000000] Disable, Rank=0
CHA[1] MASK[2:1] @ 0x0035002c[0x00000000]
CHA[1] CHIP[3:0] @ 0x0035000c[0x00000201] Enable, Rank=2
CHA[1] MASK[3:0] @ 0x00350024[0x07fffdfe] ChipSize[16777216]
CHA[1] CHIP[3:1] @ 0x00350018[0x00000000] Disable, Rank=0
CHA[1] MASK[3:1] @ 0x0035002c[0x00000000]

DIMM Size[33554432 KB] [32768 MB]

CHA[2] CHIP[0:0] @ 0x00450000[0x00000000] Disable, Rank=0
CHA[2] MASK[0:0] @ 0x00450020[0x00000000]
CHA[2] CHIP[0:1] @ 0x00450010[0x00000000] Disable, Rank=0
CHA[2] MASK[0:1] @ 0x00450028[0x00000000]
CHA[2] CHIP[1:0] @ 0x00450004[0x00000000] Disable, Rank=0
CHA[2] MASK[1:0] @ 0x00450020[0x00000000]
CHA[2] CHIP[1:1] @ 0x00450014[0x00000000] Disable, Rank=0
CHA[2] MASK[1:1] @ 0x00450028[0x00000000]
CHA[2] CHIP[2:0] @ 0x00450008[0x00000001] Enable, Rank=2
CHA[2] MASK[2:0] @ 0x00450024[0x07fffdfe] ChipSize[16777216]
CHA[2] CHIP[2:1] @ 0x00450018[0x00000000] Disable, Rank=0
CHA[2] MASK[2:1] @ 0x0045002c[0x00000000]
CHA[2] CHIP[3:0] @ 0x0045000c[0x00000201] Enable, Rank=2
CHA[2] MASK[3:0] @ 0x00450024[0x07fffdfe] ChipSize[16777216]
CHA[2] CHIP[3:1] @ 0x00450018[0x00000000] Disable, Rank=0
CHA[2] MASK[3:1] @ 0x0045002c[0x00000000]

DIMM Size[33554432 KB] [32768 MB]

CHA[3] CHIP[0:0] @ 0x00550000[0x00000000] Disable, Rank=0
CHA[3] MASK[0:0] @ 0x00550020[0x00000000]
CHA[3] CHIP[0:1] @ 0x00550010[0x00000000] Disable, Rank=0
CHA[3] MASK[0:1] @ 0x00550028[0x00000000]
CHA[3] CHIP[1:0] @ 0x00550004[0x00000000] Disable, Rank=0
CHA[3] MASK[1:0] @ 0x00550020[0x00000000]
CHA[3] CHIP[1:1] @ 0x00550014[0x00000000] Disable, Rank=0
CHA[3] MASK[1:1] @ 0x00550028[0x00000000]
CHA[3] CHIP[2:0] @ 0x00550008[0x00000001] Enable, Rank=2
CHA[3] MASK[2:0] @ 0x00550024[0x07fffdfe] ChipSize[16777216]
CHA[3] CHIP[2:1] @ 0x00550018[0x00000000] Disable, Rank=0
CHA[3] MASK[2:1] @ 0x0055002c[0x00000000]
CHA[3] CHIP[3:0] @ 0x0055000c[0x00000201] Enable, Rank=2
CHA[3] MASK[3:0] @ 0x00550024[0x07fffdfe] ChipSize[16777216]
CHA[3] CHIP[3:1] @ 0x00550018[0x00000000] Disable, Rank=0
CHA[3] MASK[3:1] @ 0x0055002c[0x00000000]

DIMM Size[33554432 KB] [32768 MB]
cyring commented 1 year ago

Hello,

A fix is available. Can you please pull master and post the UMC ?

Chlorophytus commented 1 year ago

Hello,

A fix is available. Can you please pull master and post the UMC ?

The first two channels don't show up anymore.

$ ./corefreq-cli -M
                              Zen UMC  [1490]                              
Controller #0                                                Quad Channel  
 Bus Rate  1800 MHz       Bus Speed 1799 MHz           DDR4 Speed 3599 MT/s

 Cha   CL  RCDr RCDw  RP  RAS   RC  RRDs RRDl FAW  WTRs WTRl  WR  clRR clWW
  #0    0    0    0    0    0    0    0    0    0    0    0    0    0    0 
  #1    0    0    0    0    0    0    0    0    0    0    0    0    0    0 
  #2   18   22   22   22   42   82    6    9   38    5   14   26    5    5 
  #3   18   22   22   22   42   82    6    9   38    5   14   26    5    5 
      CWL  RTP RdWr WrRd scWW sdWW ddWW scRR sdRR ddRR drRR drWW drWR drRRD
  #0    0    0    0    0    0    0    0    0    0    0    0    0    0    0 
  #1    0    0    0    0    0    0    0    0    0    0    0    0    0    0 
  #2   18   14    7    3    1    7    6    1    5    4    0    0    0    0 
  #3   18   14    8    3    1    7    6    1    5    4    0    0    0    0 
      REFI RFC1 RFC2 RFC4 RCPB RPPB  BGS:Alt  Ban  Page  CKE  CMD  GDM  ECC
  #0     0    0    0    0   0    0    ON OFF  R0W0   0    0   1T   OFF   0 
  #1     0    0    0    0   0    0    ON OFF  R0W0   0    0   1T   OFF   0 
  #2 14029  312  192  132   0    0   OFF  ON  R1W1   0    0   1T    ON   0 
  #3 14029  312  192  132   0    0   OFF  ON  R1W1   0    0   1T    ON   0 
      MRD:PDA   MOD:PDA  WRMPR STAG PDM RDDATA WRD  WRL  RDL  XS   XP CPDED
  #0    0  0      0  0      0    0 0:F:0    0   0    0    0    0    0    0 
  #1    0  0      0  0      0    0 0:F:0    0   0    0    0    0    0    0 
  #2    8  18    27  27    24  255 0:F:0   13   2   13   26 1008   11    4 
  #3    8  18    27  27    24  255 0:F:0   13   2   13   26 1008   11    4 

 DIMM Geometry for channel #0                                              
      Slot Bank Rank     Rows   Columns    Memory Size (MB)                
       #0                                                                  
       #1                                                                  
 DIMM Geometry for channel #1                                              
      Slot Bank Rank     Rows   Columns    Memory Size (MB)                
       #0                                                                  
       #1                                                                  
 DIMM Geometry for channel #2                                              
      Slot Bank Rank     Rows   Columns    Memory Size (MB)                
       #0                                                                  
       #1    16    2    131072      1024          32768  TEAMGROUP-UD4-3600
 DIMM Geometry for channel #3                                              
      Slot Bank Rank     Rows   Columns    Memory Size (MB)                
       #0                                                                  
       #1    16    2    131072      1024          32768  TEAMGROUP-UD4-3600
cyring commented 1 year ago

The first two channels don't show up anymore.

I see what's going on: your TR does not count the first enabled UMC at channel 0 but at channel 2. It is still quad channels. Thus I have to shift the whole topology based on channel zero without introducing regressions with other Zen, EPYC architectures tested so far.

Chlorophytus commented 1 year ago

The first two channels don't show up anymore.

I see what's going on: your TR does not count the first enabled UMC at channel 0 but at channel 2. It is still quad channels. Thus I have to shift the whole topology based on channel zero without introducing regressions with other Zen, EPYC architectures tested so far.

I wonder if that has to do with WRX80 Threadripper PRO CPUs having 8 memory channels. I can't test that though.

cyring commented 1 year ago

I found my queries were not based on the right UMC base address register

I have force push a new change in master. Can you give it a try ?

Chlorophytus commented 1 year ago

I found my queries were not based on the right UMC base address register

I have force push a new change in master. Can you give it a try ?

You fixed it.

Chlorophytus commented 1 year ago

Think I'll close this. I'm not sure what else needs to be done and this has a "solved" tag.

Reopen this if anything comes up.

cyring commented 1 year ago

Hello,

Could you please give a try to the latest version, in particular the output of the Memory Controller for any regression ?

Chlorophytus commented 1 year ago

Hello,

Could you please give a try to the latest version, in particular the output of the Memory Controller for any regression ?

Sure, which commit?

cyring commented 1 year ago

Hello, Could you please give a try to the latest version, in particular the output of the Memory Controller for any regression ?

Sure, which commit?

Just the latest, please.

Chlorophytus commented 1 year ago

Hello, Could you please give a try to the latest version, in particular the output of the Memory Controller for any regression ?

Sure, which commit?

Just the latest, please.

@cyring The memory controller seems to be outputting the correct results

cyring commented 1 year ago

@cyring The memory controller seems to be outputting the correct results

Great, thanks a lot for your answer.