arcress0 / ipmiutil

ipmiutil is an easy to use set of IPMI server management utilities. It can get/set sensor readings & thresholds, automate SEL management, do SOL console, etc. Supports Linux, Windows, BSD, Solaris, MacOSX. The only IPMI project tool that runs natively on Windows. See http://ipmiutil.sf.net for rpms, etc. (formerly called panicsel). It can run driverless in Linux for use on boot media or embedded environments.
BSD 3-Clause "New" or "Revised" License
33 stars 5 forks source link

[supermicro] Disable DIMM location decoding from SMBIOS for local SEL… #3

Closed albertlav closed 3 years ago

albertlav commented 4 years ago

This change is to disable DIMM location decoding from SMBIOS for local (imbdrv/ipmidrv) SEL query There is a problem for DIMM location string decoded from SEL on Supermicro hardware (at least for vendor = 10876)

Problem summary:

If SEL log query performed locally (is_remote() == FALSE) decode_mem_supermicro will try to get Bank Locator/Device Locator strings from SMBIOS Type 17 records using get_MemDesc routine:

[Memory Device (Type 17) - Length 34 - Handle 002dh]
  Memory Error Info Handle      [Not Provided]
  Total Width                   72 bits
  Data Width                    64 bits
  Size                          16384MB
  Form Factor                   09h - DIMM
  Device Set                    [None]
  Device Locator                P1-DIMMA2             <------ get_MemDesc tries to retrieve these strings from SMBIOS
  Bank Locator                  P0_Node0_Channel0_Dimm1     <------ get_MemDesc tries to retrieve these strings from SMBIOS
  Memory Type                   18h - Specification Reserved
  Type Detail                   2000h -
  Speed                         1600MHz
  Manufacturer                  Samsung            
  Serial Number                              
  Asset Tag Number                             
  Part Number                   M393B2G70QH0-YK0 

However get_MemDesc return invalid string values, (at least for vendor = 10876 (Supermicro)), this can lead to confusion as wrong DIMM location reported by SEL, and user can replace the wrong dimm based on SEL output

This change is to disable decoding using SMBIOS in decode_mem_supermicro and return constructed string using logic available in decode_mem_supermicro itself i.e. P2_DIMME2

To illustrate the problem:

consider below RAW SEL on Supermicro hardware:

0a 00 02 74 e3 a3 5c 01 00 04 0c 00 6f a1 5b 80 <-- actually decoded to P1_DIMME2
0b 00 02 74 e3 a3 5c 01 00 04 0c 00 6f a1 6a 80 <-- actually decoded to P1_DIMMF1
0c 00 02 74 e3 a3 5c 01 00 04 0c 00 6f a1 6b 80 <-- actually decoded to P1_DIMMF2
0d 00 02 75 e3 a3 5c 01 00 04 0c 00 6f a1 1a 81 <-- actually decoded to P2_DIMMA1
0e 00 02 75 e3 a3 5c 01 00 04 0c 00 6f a1 1b 81 <-- actually decoded to P2_DIMMA2
0f 00 02 75 e3 a3 5c 01 00 04 0c 00 6f a1 2a 81 <-- actually decoded to P2_DIMMB1

DIMM location decoded incorrectly via get_MemDesc for such data:

000a 04/03/19 00:34:28 MAJ EFI  Memory #00  Uncorrectable ECC, P0_Node0_Channel1_Dimm0/P1-DIMMB1 6f [a1 5b 80]
DIMM(0) vend=2a7c prod=706
decode_mem_supermicro: v2 bdata=6a(106) cpu=1 dimm=1 pair=5
P0_Node0_Channel0_Dimm1/P1-DIMMA2
000b 04/03/19 00:34:28 MAJ EFI  Memory #00  Uncorrectable ECC, P0_Node0_Channel0_Dimm1/P1-DIMMA2 6f [a1 6a 80]
DIMM(0) vend=2a7c prod=706
decode_mem_supermicro: v2 bdata=6b(107) cpu=1 dimm=2 pair=5
P0_Node0_Channel1_Dimm0/P1-DIMMB1
000c 04/03/19 00:34:28 MAJ EFI  Memory #00  Uncorrectable ECC, P0_Node0_Channel1_Dimm0/P1-DIMMB1 6f [a1 6b 80]
DIMM(1) vend=2a7c prod=706
decode_mem_supermicro: v2 bdata=1a(26) cpu=2 dimm=1 pair=0
P0_Node0_Channel0_Dimm1/P1-DIMMA2
000d 04/03/19 00:34:29 MAJ EFI  Memory #00  Uncorrectable ECC, P0_Node0_Channel0_Dimm1/P1-DIMMA2 6f [a1 1a 81]
DIMM(1) vend=2a7c prod=706
decode_mem_supermicro: v2 bdata=1b(27) cpu=2 dimm=2 pair=0
P0_Node0_Channel1_Dimm0/P1-DIMMB1
000e 04/03/19 00:34:29 MAJ EFI  Memory #00  Uncorrectable ECC, P0_Node0_Channel1_Dimm0/P1-DIMMB1 6f [a1 1b 81]
DIMM(1) vend=2a7c prod=706
decode_mem_supermicro: v2 bdata=2a(42) cpu=2 dimm=1 pair=1
P0_Node0_Channel0_Dimm1/P1-DIMMA2
000f 04/03/19 00:34:29 MAJ EFI  Memory #00  Uncorrectable ECC, P0_Node0_Channel0_Dimm1/P1-DIMMA2 6f [a1 2a 81]
DIMM(1) vend=2a7c prod=706
decode_mem_supermicro: v2 bdata=2b(43) cpu=2 dimm=2 pair=1
P0_Node0_Channel1_Dimm0/P1-DIMMB1
arcress0 commented 3 years ago

It figures that SuperMicro would mismatch their BIOS and Firmware indexes for the DIMMs. Regardless, we need to remove this for SuperMicro, at least until SuperMicro can get their act together.