NearNodeFlash / NearNodeFlash.github.io

View this document https://nearnodeflash.github.io/
Apache License 2.0
3 stars 3 forks source link

PAX device failing, segmentation violation #178

Closed roehrich-hpe closed 1 week ago

roehrich-hpe commented 1 week ago

Panic on nil pointer dereference when a PAX device is failing.

This is nnf release v0.1.4, which has nnf-sos v0.1.9.

2024-07-03T13:25:08.507-0700    INFO    ec.fabric.1.10  Binding Port    {"switchId": "1", "portId": "10", "slot": 4, "endpointId": "181", "initiatorPort": 32, "logicalPortId": 9, "pdfid": 6402}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x15c46e6]

The switch status:

[root@elcap837:~]# /admin/scripts/nnf/switch.sh status
Execute switch status on /dev/switchtec0
DEVICE: /dev/switchtec0 PAX_ID: 1

Switch Connection           Status
=========================== ======
Interswitch Link            UP
Drive Slot 4                UP
Drive Slot 5                UP
Drive Slot 6                UP
Drive Slot 2                UP
Drive Slot 1                DOWN
Drive Slot 9                UP
Drive Slot 10               UP
Drive Slot 11               UP
Drive Slot 3                UP
Rabbit,       x9000c?j7b0   UP
Compute 8,    x9000c?s4b0n0 UP
Compute 9,    x9000c?s4b1n0 UP
Compute 10,   x9000c?s5b0n0 UP
Compute 11,   x9000c?s5b1n0 UP
Compute 12,   x9000c?s6b0n0 UP
Compute 13,   x9000c?s6b1n0 UP
Compute 14,   x9000c?s7b0n0 UP
Compute 15,   x9000c?s7b1n0 UP

Execute switch status on /dev/switchtec1
/dev/switchtec1: Input/output error
Unable to retrieve PAX ID
roehrich-hpe commented 1 week ago

The full nnf-node-manager backtrace:

2024-07-03T13:25:08.507-0700    INFO    ec.fabric.1.10  Endpoint already bound  {"switchId": "1", "portId": "10", "slot": 4, "endpointId": "180", "initiatorPort": 24, "logicalPortId": 0, "pdfid": 6401, "paxId": 1, "boundPaxId": 1, "phyPortId": 24, "boundPhyPortId": 24, "logPortId": 0, "boundLogPortId": 0}
2024-07-03T13:25:08.507-0700    INFO    ec.fabric.1.10  Binding Port    {"switchId": "1", "portId": "10", "slot": 4, "endpointId": "181", "initiatorPort": 32, "logicalPortId": 9, "pdfid": 6402}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x15c46e6]

goroutine 215 [running]:
github.com/NearNodeFlash/nnf-ec/pkg/manager-fabric.(*Port).bind(0xc003a166e0)
    /workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/manager-fabric/manager.go:791 +0xac6
github.com/NearNodeFlash/nnf-ec/pkg/manager-fabric.(*Fabric).EventHandler(0x2c88460, {{0x1c1131b, 0x1}, {0x0, 0x0}, {0x0, 0x0}, {0x0, 0x0}, {0x1c3a6a2, ...}, ...})
    /workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/manager-fabric/manager.go:1134 +0x285
github.com/NearNodeFlash/nnf-ec/pkg/manager-event.(*manager).Publish(0x2ca3b80, {{0x1c1131b, 0x1}, {0x0, 0x0}, {0x0, 0x0}, {0x0, 0x0}, {0x1c3a6a2, ...}, ...})
    /workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/manager-event/manager.go:176 +0x1c3
github.com/NearNodeFlash/nnf-ec/pkg/manager-nvme.(*Storage).LinkEstablishedEventHandler(0xc003a96af8, {0x2c16208, 0x1}, {0x1c84c1c, 0x2})
    /workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/manager-nvme/manager.go:1054 +0x9bf
github.com/NearNodeFlash/nnf-ec/pkg/manager-nvme.(*Manager).EventHandler(0x2c87140, {{0x1c1131a, 0x1}, {0x0, 0x0}, {0x0, 0x0}, {0x0, 0x0}, {0x1c42d91, ...}, ...})
    /workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/manager-nvme/manager.go:884 +0xc29
github.com/NearNodeFlash/nnf-ec/pkg/manager-event.(*manager).Publish(0x2ca3b80, {{0x1c1131a, 0x1}, {0x0, 0x0}, {0x0, 0x0}, {0x0, 0x0}, {0x1c42d91, ...}, ...})
    /workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/manager-event/manager.go:176 +0x1c3
github.com/NearNodeFlash/nnf-ec/pkg/manager-fabric.(*Switch).refreshPortStatus(0xc002d62268)
    /workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/manager-fabric/manager.go:508 +0x7eb
github.com/NearNodeFlash/nnf-ec/pkg/manager-fabric.Start()
    /workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/manager-fabric/manager.go:1109 +0x418
github.com/NearNodeFlash/nnf-ec/pkg/manager-fabric.(*DefaultApiRouter).Start(0xc000233de8?)
    /workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/manager-fabric/router.go:57 +0x17
github.com/NearNodeFlash/nnf-ec/pkg/ec.(*Controller).initialize(0xc002d5a000, 0xc002b742a0?)
    /workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/ec/ec.go:171 +0x15e
github.com/NearNodeFlash/nnf-ec/pkg/ec.(*Controller).Init(0xc00027f980?, 0x1bdd105?)
    /workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/ec/ec.go:320 +0x5b
github.com/NearNodeFlash/nnf-sos/internal/controller.(*NnfNodeECDataReconciler).Start(0xc0005c0ee0, {0x1ecbca0, 0xc000327ec0})
    /workspace/internal/controller/nnf_node_ec_data_controller.go:112 +0x4d3
sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1(0xc00068f680)
    /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/manager/runnable_group.go:223 +0xdb
created by sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile
    /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/manager/runnable_group.go:207 +0x1ad
roehrich-hpe commented 1 week ago

Earlier in the log there are errors about reading /dev/switchtec1:

2024-07-03T13:25:08.352-0700    ERROR   ec.fabric.0 Error opening path  {"switchId": "0", "error": "Switchtec Command: Get PAX ID (0x47657420504158204944) Error: read /dev/switchtec1: input/output error"}
github.com/NearNodeFlash/nnf-ec/pkg/manager-fabric.(*Switch).identify
    /workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/manager-fabric/manager.go:384
github.com/NearNodeFlash/nnf-ec/pkg/manager-fabric.Initialize
    /workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/manager-fabric/manager.go:913
github.com/NearNodeFlash/nnf-ec/pkg/manager-fabric.(*DefaultApiRouter).Init
    /workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/manager-fabric/router.go:52
github.com/NearNodeFlash/nnf-ec/pkg/ec.(*Controller).initialize
    /workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/ec/ec.go:165
github.com/NearNodeFlash/nnf-ec/pkg/ec.(*Controller).Init
    /workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/ec/ec.go:320
github.com/NearNodeFlash/nnf-sos/internal/controller.(*NnfNodeECDataReconciler).Start
    /workspace/internal/controller/nnf_node_ec_data_controller.go:112
sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1
    /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/manager/runnable_group.go:223
2024-07-03T13:25:08.352-0700    ERROR   ec.fabric.0 Failed to identify switch   {"switchId": "0", "error": "Switchtec Command: Get PAX ID (0x47657420504158204944) Error: read /dev/switchtec1: input/output error"}
github.com/NearNodeFlash/nnf-ec/pkg/manager-fabric.Initialize
    /workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/manager-fabric/manager.go:914
github.com/NearNodeFlash/nnf-ec/pkg/manager-fabric.(*DefaultApiRouter).Init
    /workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/manager-fabric/router.go:52
github.com/NearNodeFlash/nnf-ec/pkg/ec.(*Controller).initialize
    /workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/ec/ec.go:165
github.com/NearNodeFlash/nnf-ec/pkg/ec.(*Controller).Init
    /workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/ec/ec.go:320
github.com/NearNodeFlash/nnf-sos/internal/controller.(*NnfNodeECDataReconciler).Start
    /workspace/internal/controller/nnf_node_ec_data_controller.go:112
sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1
    /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/manager/runnable_group.go:223

We do see /dev/switchtec0 is identified, but we never see this for /dev/switchtec1:

2024-07-03T13:25:08.457-0700    INFO    ec.fabric.1 Identified switch   {"switchId": "1", "path": "/dev/switchtec0", "model": "16896", "manufacturer": "Microsemi", "serialNumber": "1677542907", "firmwareVersion": "4.90 BC265"}
roehrich-hpe commented 1 week ago

I have a proposed fix in https://github.com/NearNodeFlash/nnf-sos/commit/b4aa94a65db995dd0429e867aa20d4ec70729022 which recognizes the situation and panics with more information rather than just hitting the null pointer:

panic: Port has no switch interface: Initiator Port 32, Logical Port 9, PDFID: 0x1902

goroutine 95 [running]:
github.com/NearNodeFlash/nnf-ec/pkg/manager-fabric.(*Port).bind(0xc000762f60)
    /workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/manager-fabric/manager.go:791 +0x10c5
...