Closed roehrich-hpe closed 3 weeks ago
The NnfNode resource:
apiVersion: nnf.cray.hpe.com/v1alpha1
kind: NnfNode
metadata:
creationTimestamp: "2024-06-05T18:28:28Z"
generation: 1
name: nnf-nlc
namespace: elcapX
resourceVersion: "129973439"
uid: b14de82a-0884-43f9-b92c-f3237091d873
spec:
name: elcapX
pod: nnf-node-manager-lr5qn
state: Enable
status:
capacity: 17283450691584
drives:
- capacity: 1920383410176
firmwareVersion: 1TCRS104
health: OK
id: "0"
model: KIOXIA KCM7DRJE1T92
serialNumber: 3D50A11N0U61
slot: "8"
status: Ready
- capacity: 1920383410176
firmwareVersion: 1TCRS104
health: OK
id: "1"
model: KIOXIA KCM7DRJE1T92
serialNumber: 3D50A09R0U61
slot: "7"
status: Ready
- capacity: 1920383410176
firmwareVersion: 1TCRS104
health: OK
id: "2"
model: KIOXIA KCM7DRJE1T92
serialNumber: 3D50A1970U61
slot: "15"
status: Ready
- capacity: 1920383410176
firmwareVersion: 1TCRS104
health: OK
id: "3"
model: KIOXIA KCM7DRJE1T92
serialNumber: 3D50A18R0U61
slot: "16"
status: Ready
- capacity: 1920383410176
firmwareVersion: 1TCRS104
health: OK
id: "4"
model: KIOXIA KCM7DRJE1T92
serialNumber: 3D50A0D00U61
slot: "17"
status: Ready
- capacity: 1920383410176
firmwareVersion: 1TCRS104
health: OK
id: "5"
model: KIOXIA KCM7DRJE1T92
serialNumber: 3D50A18S0U61
slot: "18"
status: Ready
- capacity: 1920383410176
firmwareVersion: 1TCRS104
health: OK
id: "6"
model: KIOXIA KCM7DRJE1T92
serialNumber: 3D60A03X0U61
slot: "14"
status: Ready
- health: Critical
id: "7"
slot: "13"
status: Offline
- capacity: 1920383410176
firmwareVersion: 1TCRS104
health: OK
id: "8"
model: KIOXIA KCM7DRJE1T92
serialNumber: 3D50A0GD0U61
slot: "12"
status: Ready
- capacity: 1920383410176
firmwareVersion: 1TCRS104
health: Critical
id: "9"
model: KIOXIA KCM7DRJE1T92
serialNumber: 3D50A18H0U61
slot: "4"
status: Offline
- capacity: 1920383410176
firmwareVersion: "\0\0\0\0\0\0\0\0"
health: Critical
id: "10"
model: "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"
serialNumber: "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"
slot: "5"
status: Offline
- capacity: 1920383410176
firmwareVersion: 1TCRS104
health: Critical
id: "11"
model: KIOXIA KCM7DRJE1T92
serialNumber: 3D50A11H0U61
slot: "6"
status: Offline
- capacity: 1920383410176
firmwareVersion: 1TCRS104
health: Critical
id: "12"
model: KIOXIA KCM7DRJE1T92
serialNumber: 3D60A03H0U61
slot: "2"
status: Offline
- health: Critical
id: "13"
slot: "1"
status: Offline
- capacity: 1920383410176
firmwareVersion: 1TCRS104
health: Critical
id: "14"
model: KIOXIA KCM7DRJE1T92
serialNumber: 3D50A11J0U61
slot: "9"
status: Offline
- capacity: 1920383410176
firmwareVersion: "\0\0\0\0\0\0\0\0"
health: OK
id: "15"
model: "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"
serialNumber: "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"
slot: "10"
status: Ready
- capacity: 1920383410176
firmwareVersion: 1TCRS104
health: Critical
id: "16"
model: KIOXIA KCM7DRJE1T92
serialNumber: 3D60A04U0U61
slot: "11"
status: Offline
- capacity: 1920383410176
firmwareVersion: 1TCRS104
health: OK
id: "17"
model: KIOXIA KCM7DRJE1T92
serialNumber: 3D50A18G0U61
slot: "3"
status: Ready
health: OK
lnetNid: 183802@kfi4
servers:
- health: OK
hostname: elcapX
id: "0"
name: Rabbit
status: Ready
- [...]
status: Ready
Earlier in the log:
2024-06-07T07:12:22.088-0700 INFO ec.nvme.16 Initialize storage devic
e {"storageId": "16", "slot": 11}
2024-06-07T07:12:22.092-0700 ERROR ec.nvme Failed to initialize storage device {"slot": 11, "switchId": "1", "portId": "17", "error": "Initialize Storage 16: Failed to indentify common controller: Error: Device 0x1500@/dev/switchtec0: Failed NVMe Command: OpCode: Identify (0x06): Error: NVMe Status: UNKNOWN (0x001) CRD: 0 More: false DNR: true"}
github.com/NearNodeFlash/nnf-ec/pkg/manager-nvme.(*Storage).LinkEstablishedEventHandler
/workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/manager-nvme/manager.go:908
github.com/NearNodeFlash/nnf-ec/pkg/manager-nvme.(*Manager).EventHandler
/workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/manager-nvme/manager.go:884
github.com/NearNodeFlash/nnf-ec/pkg/manager-event.(*manager).Publish
/workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/manager-event/manager.go:176
github.com/NearNodeFlash/nnf-ec/pkg/manager-fabric.(*Switch).refreshPortStatus
/workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/manager-fabric/manager.go:508
github.com/NearNodeFlash/nnf-ec/pkg/manager-fabric.Start
/workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/manager-fabric/manager.go:1109
github.com/NearNodeFlash/nnf-ec/pkg/manager-fabric.(*DefaultApiRouter).Start
/workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/manager-fabric/router.go:57
github.com/NearNodeFlash/nnf-ec/pkg/ec.(*Controller).initialize
/workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/ec/ec.go:171
github.com/NearNodeFlash/nnf-ec/pkg/ec.(*Controller).Init
/workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/ec/ec.go:320
github.com/NearNodeFlash/nnf-sos/internal/controller.(*NnfNodeECDataReconciler).Start
/workspace/internal/controller/nnf_node_ec_data_controller.go:112
sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/manager/runnable_group.go:223
The entire log: nnf-node-manager-lr5qn.log
Using nnf-deploy-v0.1.2