Open larsks opened 1 month ago
Thanks @larsks I checked today after the boot and I can see the kernel trying to correct some errors:
[27275.959400] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[27275.959409] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
[27275.959411] {1}[Hardware Error]: event severity: corrected
[27275.959413] {1}[Hardware Error]: Error 0, type: corrected
[27275.959415] {1}[Hardware Error]: fru_text: B1
[27275.959417] {1}[Hardware Error]: section_type: memory error
[27275.959418] {1}[Hardware Error]: error_status: Storage error in DRAM memory (0x0000000000000400)
[27275.959421] {1}[Hardware Error]: physical_address: 0x0000001190839f80
[27275.959426] {1}[Hardware Error]: node:1 card:1 module:0 rank:1 bank:0 row:4617 column:632
[27275.959428] {1}[Hardware Error]: error_type: 2, single-bit ECC
[27275.959451] mce: [Hardware Error]: Machine check events logged
[27275.959454] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[27275.959456] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 9c0000000000009f
[27275.959459] EDAC sbridge MC0: TSC 0
[27275.959461] EDAC sbridge MC0: ADDR 1190839f80
[27275.959463] EDAC sbridge MC0: MISC 8c
[27275.959464] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1727303886 SOCKET 0 APIC 0
[27275.959475] EDAC MC0: 0 CE Invalid channel 0xf on any memory ( page:0x0 offset:0x0 grain:32 syndrome:0x0)
As you commented already it is worth replacing it.
@hakasapl do we have a schedule for when we or techsqured will replace this?
@aabaris did this get done already or what that for a different node?
@aabaris did this get done already or what that for a different node?
This still needs to be addressed. I need additional information to proceed.
It would be preferable to fix the existing node. @hakasapl do you know if we have standby parts for this HW and if Techsquare has access to it? I would like to make sure Techsquare has ability to replace the memory DIMM, before I send the request.
Also, do we need to schedule downtime for this node or is it already down in OBS Openshift? In either case of repair or allocating a new node I need a bit of a guidance on coordinating from OBS cluster perspective.
@aabaris do you know the node type and where it is located? I can answer based on that
@aabaris do you know the node type and where it is located? I can answer based on that
It's a PowerEdge FC430.
server: obm: 10.30.0.86 serial: CZMS0Q2
rack: R1-PC-C20
chassis: name: cmc-8-obm.nerc-ocp-prod.nerc.mghpcc.org serial: CZNW0Q2 u: 23 slot: Server-1d
@aabaris Yes, techsquare has the parts for this. They can pull spare parts from the spare parts rack R4-PA-C22
(we're using that for storage right now, everything is powered off)
@aabaris Yes, techsquare has the parts for this. They can pull spare parts from the spare parts rack
R4-PA-C22
(we're using that for storage right now, everything is powered off)
Thank you @hakasapl with this info I'll be able to engage Techsquare for repair work.
@schwesig and @larsks, wrk-0 is an active node, how do you guys want to handle the downtime for this? I can make sure Techsquare performs repair within time window we all agree on, but could use your help to make sure the node is taken down properly as well as making a call on when is the best time to do this.
@aabaris Yes, techsquare has the parts for this. They can pull spare parts from the spare parts rack
R4-PA-C22
(we're using that for storage right now, everything is powered off)
@hakasapl do you know what keys Techsquare will need in order to access R4-PA-C22(parts) and R1-PC-C20(server), I asked Wayne who's authorized to view BU organization rack access and he did not find a key set associated with those racks. If you don't have this information, I'll open a ticket with MGHPCC to find out. Thank you.
@aabaris Parts are unlocked (they've used this rack before so it should be familiar). 1-C-20 is UMass keys or MOC keys I believe, they should have access to both
I drafted the request to techsquare in an issue: https://github.com/nerc-project/operations/issues/810 I will pass it on to them once we cleared a plan for node shutdown.
This is similar to #1390 (but for a different system).
Node
wrk-0
of the observability cluster (obs.nerc.mghpcc.org
), service tagCZMS0Q2
, is experiencing memory errors onDIMM_B1
:The errors did not repeat after a cold boot, but we should probably have this memory replaced.