Closed roehrich-hpe closed 1 week ago
The full nnf-node-manager backtrace:
2024-07-03T13:25:08.507-0700 INFO ec.fabric.1.10 Endpoint already bound {"switchId": "1", "portId": "10", "slot": 4, "endpointId": "180", "initiatorPort": 24, "logicalPortId": 0, "pdfid": 6401, "paxId": 1, "boundPaxId": 1, "phyPortId": 24, "boundPhyPortId": 24, "logPortId": 0, "boundLogPortId": 0}
2024-07-03T13:25:08.507-0700 INFO ec.fabric.1.10 Binding Port {"switchId": "1", "portId": "10", "slot": 4, "endpointId": "181", "initiatorPort": 32, "logicalPortId": 9, "pdfid": 6402}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x15c46e6]
goroutine 215 [running]:
github.com/NearNodeFlash/nnf-ec/pkg/manager-fabric.(*Port).bind(0xc003a166e0)
/workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/manager-fabric/manager.go:791 +0xac6
github.com/NearNodeFlash/nnf-ec/pkg/manager-fabric.(*Fabric).EventHandler(0x2c88460, {{0x1c1131b, 0x1}, {0x0, 0x0}, {0x0, 0x0}, {0x0, 0x0}, {0x1c3a6a2, ...}, ...})
/workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/manager-fabric/manager.go:1134 +0x285
github.com/NearNodeFlash/nnf-ec/pkg/manager-event.(*manager).Publish(0x2ca3b80, {{0x1c1131b, 0x1}, {0x0, 0x0}, {0x0, 0x0}, {0x0, 0x0}, {0x1c3a6a2, ...}, ...})
/workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/manager-event/manager.go:176 +0x1c3
github.com/NearNodeFlash/nnf-ec/pkg/manager-nvme.(*Storage).LinkEstablishedEventHandler(0xc003a96af8, {0x2c16208, 0x1}, {0x1c84c1c, 0x2})
/workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/manager-nvme/manager.go:1054 +0x9bf
github.com/NearNodeFlash/nnf-ec/pkg/manager-nvme.(*Manager).EventHandler(0x2c87140, {{0x1c1131a, 0x1}, {0x0, 0x0}, {0x0, 0x0}, {0x0, 0x0}, {0x1c42d91, ...}, ...})
/workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/manager-nvme/manager.go:884 +0xc29
github.com/NearNodeFlash/nnf-ec/pkg/manager-event.(*manager).Publish(0x2ca3b80, {{0x1c1131a, 0x1}, {0x0, 0x0}, {0x0, 0x0}, {0x0, 0x0}, {0x1c42d91, ...}, ...})
/workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/manager-event/manager.go:176 +0x1c3
github.com/NearNodeFlash/nnf-ec/pkg/manager-fabric.(*Switch).refreshPortStatus(0xc002d62268)
/workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/manager-fabric/manager.go:508 +0x7eb
github.com/NearNodeFlash/nnf-ec/pkg/manager-fabric.Start()
/workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/manager-fabric/manager.go:1109 +0x418
github.com/NearNodeFlash/nnf-ec/pkg/manager-fabric.(*DefaultApiRouter).Start(0xc000233de8?)
/workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/manager-fabric/router.go:57 +0x17
github.com/NearNodeFlash/nnf-ec/pkg/ec.(*Controller).initialize(0xc002d5a000, 0xc002b742a0?)
/workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/ec/ec.go:171 +0x15e
github.com/NearNodeFlash/nnf-ec/pkg/ec.(*Controller).Init(0xc00027f980?, 0x1bdd105?)
/workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/ec/ec.go:320 +0x5b
github.com/NearNodeFlash/nnf-sos/internal/controller.(*NnfNodeECDataReconciler).Start(0xc0005c0ee0, {0x1ecbca0, 0xc000327ec0})
/workspace/internal/controller/nnf_node_ec_data_controller.go:112 +0x4d3
sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1(0xc00068f680)
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/manager/runnable_group.go:223 +0xdb
created by sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/manager/runnable_group.go:207 +0x1ad
Earlier in the log there are errors about reading /dev/switchtec1:
2024-07-03T13:25:08.352-0700 ERROR ec.fabric.0 Error opening path {"switchId": "0", "error": "Switchtec Command: Get PAX ID (0x47657420504158204944) Error: read /dev/switchtec1: input/output error"}
github.com/NearNodeFlash/nnf-ec/pkg/manager-fabric.(*Switch).identify
/workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/manager-fabric/manager.go:384
github.com/NearNodeFlash/nnf-ec/pkg/manager-fabric.Initialize
/workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/manager-fabric/manager.go:913
github.com/NearNodeFlash/nnf-ec/pkg/manager-fabric.(*DefaultApiRouter).Init
/workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/manager-fabric/router.go:52
github.com/NearNodeFlash/nnf-ec/pkg/ec.(*Controller).initialize
/workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/ec/ec.go:165
github.com/NearNodeFlash/nnf-ec/pkg/ec.(*Controller).Init
/workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/ec/ec.go:320
github.com/NearNodeFlash/nnf-sos/internal/controller.(*NnfNodeECDataReconciler).Start
/workspace/internal/controller/nnf_node_ec_data_controller.go:112
sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/manager/runnable_group.go:223
2024-07-03T13:25:08.352-0700 ERROR ec.fabric.0 Failed to identify switch {"switchId": "0", "error": "Switchtec Command: Get PAX ID (0x47657420504158204944) Error: read /dev/switchtec1: input/output error"}
github.com/NearNodeFlash/nnf-ec/pkg/manager-fabric.Initialize
/workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/manager-fabric/manager.go:914
github.com/NearNodeFlash/nnf-ec/pkg/manager-fabric.(*DefaultApiRouter).Init
/workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/manager-fabric/router.go:52
github.com/NearNodeFlash/nnf-ec/pkg/ec.(*Controller).initialize
/workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/ec/ec.go:165
github.com/NearNodeFlash/nnf-ec/pkg/ec.(*Controller).Init
/workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/ec/ec.go:320
github.com/NearNodeFlash/nnf-sos/internal/controller.(*NnfNodeECDataReconciler).Start
/workspace/internal/controller/nnf_node_ec_data_controller.go:112
sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/manager/runnable_group.go:223
We do see /dev/switchtec0 is identified, but we never see this for /dev/switchtec1:
2024-07-03T13:25:08.457-0700 INFO ec.fabric.1 Identified switch {"switchId": "1", "path": "/dev/switchtec0", "model": "16896", "manufacturer": "Microsemi", "serialNumber": "1677542907", "firmwareVersion": "4.90 BC265"}
I have a proposed fix in https://github.com/NearNodeFlash/nnf-sos/commit/b4aa94a65db995dd0429e867aa20d4ec70729022 which recognizes the situation and panics with more information rather than just hitting the null pointer:
panic: Port has no switch interface: Initiator Port 32, Logical Port 9, PDFID: 0x1902
goroutine 95 [running]:
github.com/NearNodeFlash/nnf-ec/pkg/manager-fabric.(*Port).bind(0xc000762f60)
/workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/manager-fabric/manager.go:791 +0x10c5
...
Panic on nil pointer dereference when a PAX device is failing.
This is nnf release v0.1.4, which has nnf-sos v0.1.9.
The switch status: