Bad memory on obs cluster node wrk-0

larsks commented 1 month ago

This is similar to #1390 (but for a different system).

Node wrk-0 of the observability cluster (obs.nerc.mghpcc.org), service tag CZMS0Q2, is experiencing memory errors on DIMM_B1:

$ curl -sSk -u '...' https://10.30.0.86/redfish/v1/Managers/iDRAC.Embedded.1/Logs/Sel |
  jq -r '.Members[]|select(.Severity == "Critical")|[.Created,.Message]|@tsv' |
  grep 2024-09

2024-09-25T09:32:08-05:00       Multi-bit memory errors detected on a memory device at location(s) DIMM_B1.
2024-09-25T09:32:08-05:00       Multi-bit memory errors detected on a memory device at location(s) DIMM_B1.

The errors did not repeat after a cold boot, but we should probably have this memory replaced.

RH-csaggin commented 1 month ago

Thanks @larsks I checked today after the boot and I can see the kernel trying to correct some errors:

[27275.959400] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[27275.959409] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
[27275.959411] {1}[Hardware Error]: event severity: corrected
[27275.959413] {1}[Hardware Error]:  Error 0, type: corrected
[27275.959415] {1}[Hardware Error]:  fru_text: B1
[27275.959417] {1}[Hardware Error]:   section_type: memory error
[27275.959418] {1}[Hardware Error]:    error_status: Storage error in DRAM memory (0x0000000000000400)
[27275.959421] {1}[Hardware Error]:   physical_address: 0x0000001190839f80
[27275.959426] {1}[Hardware Error]:   node:1 card:1 module:0 rank:1 bank:0 row:4617 column:632
[27275.959428] {1}[Hardware Error]:   error_type: 2, single-bit ECC
[27275.959451] mce: [Hardware Error]: Machine check events logged
[27275.959454] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[27275.959456] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 9c0000000000009f
[27275.959459] EDAC sbridge MC0: TSC 0
[27275.959461] EDAC sbridge MC0: ADDR 1190839f80
[27275.959463] EDAC sbridge MC0: MISC 8c
[27275.959464] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1727303886 SOCKET 0 APIC 0
[27275.959475] EDAC MC0: 0 CE Invalid channel 0xf on any memory ( page:0x0 offset:0x0 grain:32 syndrome:0x0)

As you commented already it is worth replacing it.

schwesig commented 1 month ago

https://github.com/nerc-project/operations/issues/746#issuecomment-2376362551

joachimweyl commented 1 month ago

@hakasapl do we have a schedule for when we or techsqured will replace this?

hakasapl commented 1 week ago

@aabaris did this get done already or what that for a different node?

aabaris commented 1 week ago

@aabaris did this get done already or what that for a different node?

This still needs to be addressed. I need additional information to proceed.

It would be preferable to fix the existing node. @hakasapl do you know if we have standby parts for this HW and if Techsquare has access to it? I would like to make sure Techsquare has ability to replace the memory DIMM, before I send the request.

Also, do we need to schedule downtime for this node or is it already down in OBS Openshift? In either case of repair or allocating a new node I need a bit of a guidance on coordinating from OBS cluster perspective.

hakasapl commented 1 week ago

@aabaris do you know the node type and where it is located? I can answer based on that

aabaris commented 1 week ago

@aabaris do you know the node type and where it is located? I can answer based on that

It's a PowerEdge FC430.

server: obm: 10.30.0.86 serial: CZMS0Q2

rack: R1-PC-C20

chassis: name: cmc-8-obm.nerc-ocp-prod.nerc.mghpcc.org serial: CZNW0Q2 u: 23 slot: Server-1d

hakasapl commented 1 week ago

@aabaris Yes, techsquare has the parts for this. They can pull spare parts from the spare parts rack R4-PA-C22 (we're using that for storage right now, everything is powered off)

aabaris commented 1 week ago

@aabaris Yes, techsquare has the parts for this. They can pull spare parts from the spare parts rack R4-PA-C22 (we're using that for storage right now, everything is powered off)

Thank you @hakasapl with this info I'll be able to engage Techsquare for repair work.

@schwesig and @larsks, wrk-0 is an active node, how do you guys want to handle the downtime for this? I can make sure Techsquare performs repair within time window we all agree on, but could use your help to make sure the node is taken down properly as well as making a call on when is the best time to do this.

aabaris commented 1 week ago

@aabaris Yes, techsquare has the parts for this. They can pull spare parts from the spare parts rack R4-PA-C22 (we're using that for storage right now, everything is powered off)

@hakasapl do you know what keys Techsquare will need in order to access R4-PA-C22(parts) and R1-PC-C20(server), I asked Wayne who's authorized to view BU organization rack access and he did not find a key set associated with those racks. If you don't have this information, I'll open a ticket with MGHPCC to find out. Thank you.

hakasapl commented 1 week ago

@aabaris Parts are unlocked (they've used this rack before so it should be familiar). 1-C-20 is UMass keys or MOC keys I believe, they should have access to both

aabaris commented 1 week ago

I drafted the request to techsquare in an issue: https://github.com/nerc-project/operations/issues/810 I will pass it on to them once we cleared a plan for node shutdown.

CCI-MOC / ops-issues

Bad memory on obs cluster node wrk-0 #1394