NearNodeFlash / NearNodeFlash.github.io

View this document https://nearnodeflash.github.io/
Apache License 2.0
3 stars 3 forks source link

divide by zero in nnf-ec/pkg/manager-nvme/manager.go #162

Closed roehrich-hpe closed 3 weeks ago

roehrich-hpe commented 3 weeks ago

Using nnf-deploy-v0.1.2

$ kubectl logs -n nnf-system nnf-node-manager-lr5qn
[...]
2024-06-07T07:07:13.896-0700    INFO    Observed a panic in reconciler: runtime error: integer divide by zero   {"controller": "nnfnode", "controllerGroup": "nnf.cray.hpe.com", "controllerKind": "NnfNode", "NnfNode": {"name":"nnf-nlc","namespace":"elcap886"}, "namespace": "elcap886", "name": "nnf-nlc", "reconcileID": "3421a141-edb3-48b2-987d-d2b9caaef995"}
panic: runtime error: integer divide by zero [recovered]
    panic: runtime error: integer divide by zero

goroutine 606 [running]:
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
    /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:116 +0x1fa
panic({0x197a480, 0x2c78ec0})
    /usr/local/go/src/runtime/panic.go:884 +0x212
github.com/NearNodeFlash/nnf-ec/pkg/manager-nvme.(*Manager).StorageIdStoragePoolsStoragePoolIdGet(0xc0049f7650?, {0x1c8407b, 0x2}, {0x1bdc14e, 0x1}, 0xc0049d6ba8)
    /workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/manager-nvme/manager.go:1140 +0x278
github.com/NearNodeFlash/nnf-sos/internal/controller.updateDrives(0xc00429d880, {{0x1ece1a8?, 0xc00498bf50?}, 0xc0004bcb40?})
    /workspace/internal/controller/nnf_node_controller.go:482 +0x925
github.com/NearNodeFlash/nnf-sos/internal/controller.(*NnfNodeReconciler).Reconcile(0xc0001e5a40, {0x1ecafc8, 0xc00498bf20}, {{{0xc00317e410, 0x8}, {0xc00317e406, 0x7}}})
    /workspace/internal/controller/nnf_node_controller.go:294 +0x837
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x1ece1a8?, {0x1ecafc8?, 0xc00498bf20?}, {{{0xc00317e410?, 0xb?}, {0xc00317e406?, 0x0?}}})
    /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:119 +0xc8
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc000397860, {0x1ecaf20, 0xc0001df840}, {0x1a147a0?, 0xc0006d2520?})
    /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:316 +0x3f9
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc000397860, {0x1ecaf20, 0xc0001df840})
    /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266 +0x1d9
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
    /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227 +0x85
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
    /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:223 +0x333
roehrich-hpe commented 3 weeks ago

The NnfNode resource:

apiVersion: nnf.cray.hpe.com/v1alpha1
kind: NnfNode
metadata:
  creationTimestamp: "2024-06-05T18:28:28Z"
  generation: 1
  name: nnf-nlc
  namespace: elcapX
  resourceVersion: "129973439"
  uid: b14de82a-0884-43f9-b92c-f3237091d873
spec:
  name: elcapX
  pod: nnf-node-manager-lr5qn
  state: Enable
status:
  capacity: 17283450691584
  drives:
  - capacity: 1920383410176
    firmwareVersion: 1TCRS104
    health: OK
    id: "0"
    model: KIOXIA KCM7DRJE1T92
    serialNumber: 3D50A11N0U61
    slot: "8"
    status: Ready
  - capacity: 1920383410176
    firmwareVersion: 1TCRS104
    health: OK
    id: "1"
    model: KIOXIA KCM7DRJE1T92
    serialNumber: 3D50A09R0U61
    slot: "7"
    status: Ready
  - capacity: 1920383410176
    firmwareVersion: 1TCRS104
    health: OK
    id: "2"
    model: KIOXIA KCM7DRJE1T92
    serialNumber: 3D50A1970U61
    slot: "15"
    status: Ready
  - capacity: 1920383410176
    firmwareVersion: 1TCRS104
    health: OK
    id: "3"
    model: KIOXIA KCM7DRJE1T92
    serialNumber: 3D50A18R0U61
    slot: "16"
    status: Ready
  - capacity: 1920383410176
    firmwareVersion: 1TCRS104
    health: OK
    id: "4"
    model: KIOXIA KCM7DRJE1T92
    serialNumber: 3D50A0D00U61
    slot: "17"
    status: Ready
  - capacity: 1920383410176
    firmwareVersion: 1TCRS104
    health: OK
    id: "5"
    model: KIOXIA KCM7DRJE1T92
    serialNumber: 3D50A18S0U61
    slot: "18"
    status: Ready
  - capacity: 1920383410176
    firmwareVersion: 1TCRS104
    health: OK
    id: "6"
    model: KIOXIA KCM7DRJE1T92
    serialNumber: 3D60A03X0U61
    slot: "14"
    status: Ready
  - health: Critical
    id: "7"
    slot: "13"
    status: Offline
  - capacity: 1920383410176
    firmwareVersion: 1TCRS104
    health: OK
    id: "8"
    model: KIOXIA KCM7DRJE1T92
    serialNumber: 3D50A0GD0U61
    slot: "12"
    status: Ready
  - capacity: 1920383410176
    firmwareVersion: 1TCRS104
    health: Critical
    id: "9"
    model: KIOXIA KCM7DRJE1T92
    serialNumber: 3D50A18H0U61
    slot: "4"
    status: Offline
  - capacity: 1920383410176
    firmwareVersion: "\0\0\0\0\0\0\0\0"
    health: Critical
    id: "10"
    model: "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"
    serialNumber: "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"
    slot: "5"
    status: Offline
  - capacity: 1920383410176
    firmwareVersion: 1TCRS104
    health: Critical
    id: "11"
    model: KIOXIA KCM7DRJE1T92
    serialNumber: 3D50A11H0U61
    slot: "6"
    status: Offline
  - capacity: 1920383410176
    firmwareVersion: 1TCRS104
    health: Critical
    id: "12"
    model: KIOXIA KCM7DRJE1T92
    serialNumber: 3D60A03H0U61
    slot: "2"
    status: Offline
  - health: Critical
    id: "13"
    slot: "1"
    status: Offline
  - capacity: 1920383410176
    firmwareVersion: 1TCRS104
    health: Critical
    id: "14"
    model: KIOXIA KCM7DRJE1T92
    serialNumber: 3D50A11J0U61
    slot: "9"
    status: Offline
  - capacity: 1920383410176
    firmwareVersion: "\0\0\0\0\0\0\0\0"
    health: OK
    id: "15"
    model: "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"
    serialNumber: "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"
    slot: "10"
    status: Ready
  - capacity: 1920383410176
    firmwareVersion: 1TCRS104
    health: Critical
    id: "16"
    model: KIOXIA KCM7DRJE1T92
    serialNumber: 3D60A04U0U61
    slot: "11"
    status: Offline
  - capacity: 1920383410176
    firmwareVersion: 1TCRS104
    health: OK
    id: "17"
    model: KIOXIA KCM7DRJE1T92
    serialNumber: 3D50A18G0U61
    slot: "3"
    status: Ready
  health: OK
  lnetNid: 183802@kfi4
  servers:
  - health: OK
    hostname: elcapX
    id: "0"
    name: Rabbit
    status: Ready
  - [...]
  status: Ready
roehrich-hpe commented 3 weeks ago

Earlier in the log:

2024-06-07T07:12:22.088-0700    INFO    ec.nvme.16      Initialize storage devic
e       {"storageId": "16", "slot": 11}
2024-06-07T07:12:22.092-0700    ERROR   ec.nvme Failed to initialize storage device     {"slot": 11, "switchId": "1", "portId": "17", "error": "Initialize Storage 16: Failed to indentify common controller: Error: Device 0x1500@/dev/switchtec0: Failed NVMe Command: OpCode: Identify (0x06): Error: NVMe Status: UNKNOWN (0x001) CRD: 0 More: false DNR: true"}
github.com/NearNodeFlash/nnf-ec/pkg/manager-nvme.(*Storage).LinkEstablishedEventHandler
        /workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/manager-nvme/manager.go:908
github.com/NearNodeFlash/nnf-ec/pkg/manager-nvme.(*Manager).EventHandler
        /workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/manager-nvme/manager.go:884
github.com/NearNodeFlash/nnf-ec/pkg/manager-event.(*manager).Publish
        /workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/manager-event/manager.go:176
github.com/NearNodeFlash/nnf-ec/pkg/manager-fabric.(*Switch).refreshPortStatus
        /workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/manager-fabric/manager.go:508
github.com/NearNodeFlash/nnf-ec/pkg/manager-fabric.Start
        /workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/manager-fabric/manager.go:1109
github.com/NearNodeFlash/nnf-ec/pkg/manager-fabric.(*DefaultApiRouter).Start
        /workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/manager-fabric/router.go:57
github.com/NearNodeFlash/nnf-ec/pkg/ec.(*Controller).initialize
        /workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/ec/ec.go:171
github.com/NearNodeFlash/nnf-ec/pkg/ec.(*Controller).Init
        /workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/ec/ec.go:320
github.com/NearNodeFlash/nnf-sos/internal/controller.(*NnfNodeECDataReconciler).Start
        /workspace/internal/controller/nnf_node_ec_data_controller.go:112
sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/manager/runnable_group.go:223
roehrich-hpe commented 3 weeks ago

The entire log: nnf-node-manager-lr5qn.log

roehrich-hpe commented 3 weeks ago

PRs in https://github.com/NearNodeFlash/nnf-ec/pull/101 https://github.com/NearNodeFlash/nnf-sos/pull/311