NetApp / trident

Storage orchestrator for containers
Apache License 2.0
762 stars 222 forks source link

upgrade to 22.10.0 - trident crashes when a volume has state = upgrading #787

Closed nitnatsnocER closed 1 year ago

nitnatsnocER commented 1 year ago

Describe the bug During update from 22.07.0 to 22.10.0 I face a segfault error in the log. We use trident to manage volumes on solidfire storage via ISCSI from kubernetes. We do not use the operator, we have our own helm chart.

here is the log:

trident-6cff996fdb-khggp trident time="2022-12-05T14:31:21+01:00" level=info msg="Running Trident storage orchestrator." binary=/bin/trident build_time="Mon Oct 31 16:03:20 EDT 2022" version=22.10.0 trident-6cff996fdb-khggp trident time="2022-12-05T14:31:22+01:00" level=info msg="Created Kubernetes clients." namespace=kube-system version=v1.24.4 trident-6cff996fdb-khggp trident time="2022-12-05T14:31:22+01:00" level=info msg="Initializing metrics frontend." address=":8001" trident-6cff996fdb-khggp trident time="2022-12-05T14:31:22+01:00" level=info msg="Added frontend." name=metrics trident-6cff996fdb-khggp trident time="2022-12-05T14:31:22+01:00" level=info msg="Initializing K8S helper frontend." requestID=63ca2515-8ac1-4f6b-9b88-3b8036e46cc9 requestSource=Internal trident-6cff996fdb-khggp trident time="2022-12-05T14:31:22+01:00" level=info msg="K8S helper determined the container orchestrator version." gitVersion=v1.24.4 requestID=63ca2515-8ac1-4f6b-9b88-3b8036e46cc9 requestSource=Internal version=1.24 trident-6cff996fdb-khggp trident time="2022-12-05T14:31:22+01:00" level=info msg="Added frontend." name=k8s_csi_helper trident-6cff996fdb-khggp trident time="2022-12-05T14:31:22+01:00" level=info msg="Initializing CSI frontend." name=XXXpc1XXX version=22.10.0 trident-6cff996fdb-khggp trident time="2022-12-05T14:31:22+01:00" level=info msg="Enabling controller service capability." capability=CREATE_DELETE_VOLUME trident-6cff996fdb-khggp trident time="2022-12-05T14:31:22+01:00" level=info msg="Enabling controller service capability." capability=PUBLISH_UNPUBLISH_VOLUME trident-6cff996fdb-khggp trident time="2022-12-05T14:31:22+01:00" level=info msg="Enabling controller service capability." capability=LIST_VOLUMES trident-6cff996fdb-khggp trident time="2022-12-05T14:31:22+01:00" level=info msg="Enabling controller service capability." capability=CREATE_DELETE_SNAPSHOT trident-6cff996fdb-khggp trident time="2022-12-05T14:31:22+01:00" level=info msg="Enabling controller service capability." capability=LIST_SNAPSHOTS trident-6cff996fdb-khggp trident time="2022-12-05T14:31:22+01:00" level=info msg="Enabling controller service capability." capability=EXPAND_VOLUME trident-6cff996fdb-khggp trident time="2022-12-05T14:31:22+01:00" level=info msg="Enabling controller service capability." capability=CLONE_VOLUME trident-6cff996fdb-khggp trident time="2022-12-05T14:31:22+01:00" level=info msg="Enabling controller service capability." capability=LIST_VOLUMES_PUBLISHED_NODES trident-6cff996fdb-khggp trident time="2022-12-05T14:31:22+01:00" level=info msg="Enabling volume access mode." mode=SINGLE_NODE_WRITER trident-6cff996fdb-khggp trident time="2022-12-05T14:31:22+01:00" level=info msg="Enabling volume access mode." mode=SINGLE_NODE_READER_ONLY trident-6cff996fdb-khggp trident time="2022-12-05T14:31:22+01:00" level=info msg="Enabling volume access mode." mode=MULTI_NODE_READER_ONLY trident-6cff996fdb-khggp trident time="2022-12-05T14:31:22+01:00" level=info msg="Enabling volume access mode." mode=MULTI_NODE_SINGLE_WRITER trident-6cff996fdb-khggp trident time="2022-12-05T14:31:22+01:00" level=info msg="Enabling volume access mode." mode=MULTI_NODE_MULTI_WRITER trident-6cff996fdb-khggp trident time="2022-12-05T14:31:22+01:00" level=info msg="Added frontend." name=csi trident-6cff996fdb-khggp trident time="2022-12-05T14:31:22+01:00" level=info msg="Initializing Trident CRD controller frontend." namespace=kube-system trident-6cff996fdb-khggp trident time="2022-12-05T14:31:22+01:00" level=info msg="Creating event broadcaster." trident-6cff996fdb-khggp trident time="2022-12-05T14:31:22+01:00" level=info msg="Setting up CRD controller event handlers." trident-6cff996fdb-khggp trident time="2022-12-05T14:31:22+01:00" level=info msg="Added frontend." name=crd trident-6cff996fdb-khggp trident time="2022-12-05T14:31:22+01:00" level=info msg="Initializing HTTP REST frontend." address="127.0.0.1:8000" trident-6cff996fdb-khggp trident time="2022-12-05T14:31:22+01:00" level=info msg="Added frontend." name="HTTP REST" trident-6cff996fdb-khggp trident time="2022-12-05T14:31:22+01:00" level=info msg="Initializing HTTPS REST frontend." address=":9443" trident-6cff996fdb-khggp trident time="2022-12-05T14:31:22+01:00" level=info msg="Added frontend." name="HTTPS REST" trident-6cff996fdb-khggp trident time="2022-12-05T14:31:22+01:00" level=info msg="Activating metrics frontend." address=":8001" trident-6cff996fdb-khggp trident time="2022-12-05T14:31:22+01:00" level=info msg="Activating HTTP REST frontend." address="127.0.0.1:8000" trident-6cff996fdb-khggp trident time="2022-12-05T14:31:22+01:00" level=info msg="Activating HTTPS REST frontend." address=":9443" trident-6cff996fdb-khggp trident time="2022-12-05T14:31:22+01:00" level=info msg="Storage driver initialized." driver=solidfire-san requestID=dc24f3fa-1193-415a-981d-1f3f2f16251c requestSource=Internal trident-6cff996fdb-khggp trident time="2022-12-05T14:31:22+01:00" level=info msg="Created new storage backend." backend="&{0xc000707680 solidfire_XXX.XXX.XXX.XXX true online map[default:0xc000952de0 fast:0xc000952e40 slow:0xc000952d80] map[] false}" requestID=dc24f3fa-1193-415a-981d-1f3f 2f16251c requestSource=Internal trident-6cff996fdb-khggp trident time="2022-12-05T14:31:22+01:00" level=info msg="Newly added backend satisfies no storage classes." backend=solidfire_XXX.XXX.XXX.XXX requestID=dc24f3fa-1193-415a-981d-1f3f2f16251c requestSource=Internal trident-6cff996fdb-khggp trident time="2022-12-05T14:31:22+01:00" level=info msg="Added an existing backend." backend=solidfire_XXX.XXX.XXX.XXX backendUUID=5ce563b7-6f62-4d8b-9c22-65d816c74938 configRef= handler=Bootstrap online=true persistentBackends.BackendUUID=5ce563b7-6f62-4d8b-9c22-6 5d816c74938 requestID=dc24f3fa-1193-415a-981d-1f3f2f16251c requestSource=Internal state=online trident-6cff996fdb-khggp trident time="2022-12-05T14:31:22+01:00" level=info msg="Added an existing storage class." handler=Bootstrap requestID=dc24f3fa-1193-415a-981d-1f3f2f16251c requestSource=Internal storageClass=solidfire-default trident-6cff996fdb-khggp trident time="2022-12-05T14:31:22+01:00" level=info msg="Added an existing storage class." handler=Bootstrap requestID=dc24f3fa-1193-415a-981d-1f3f2f16251c requestSource=Internal storageClass=solidfire-fast trident-6cff996fdb-khggp trident time="2022-12-05T14:31:22+01:00" level=info msg="Added an existing storage class." handler=Bootstrap requestID=dc24f3fa-1193-415a-981d-1f3f2f16251c requestSource=Internal storageClass=solidfire-slow trident-6cff996fdb-khggp trident time="2022-12-05T14:31:22+01:00" level=info msg="Added 93 existing volume(s)" requestID=dc24f3fa-1193-415a-981d-1f3f2f16251c requestSource=Internal trident-6cff996fdb-khggp trident time="2022-12-05T14:31:22+01:00" level=info msg="Added an existing node." handler=Bootstrap node=XXXpc1XXX requestID=dc24f3fa-1193-415a-981d-1f3f2f16251c requestSource=Internal trident-6cff996fdb-khggp trident time="2022-12-05T14:31:22+01:00" level=info msg="Added an existing node." handler=Bootstrap node=XXXpc2XXX requestID=dc24f3fa-1193-415a-981d-1f3f2f16251c requestSource=Internal trident-6cff996fdb-khggp trident time="2022-12-05T14:31:22+01:00" level=info msg="Added an existing node." handler=Bootstrap node=XXXpc3XXX requestID=dc24f3fa-1193-415a-981d-1f3f2f16251c requestSource=Internal trident-6cff996fdb-khggp trident time="2022-12-05T14:31:22+01:00" level=info msg="Added an existing node." handler=Bootstrap node=XXXpc4XXX requestID=dc24f3fa-1193-415a-981d-1f3f2f16251c requestSource=Internal trident-6cff996fdb-khggp trident time="2022-12-05T14:31:22+01:00" level=info msg="Added an existing node." handler=Bootstrap node=XXXpc5XXX requestID=dc24f3fa-1193-415a-981d-1f3f2f16251c requestSource=Internal trident-6cff996fdb-khggp trident time="2022-12-05T14:31:22+01:00" level=info msg="Added an existing node." handler=Bootstrap node=XXXpc6XXX requestID=dc24f3fa-1193-415a-981d-1f3f2f16251c requestSource=Internal trident-6cff996fdb-khggp trident time="2022-12-05T14:31:22+01:00" level=info msg="Added an existing node." handler=Bootstrap node=XXXpc7XXX requestID=dc24f3fa-1193-415a-981d-1f3f2f16251c requestSource=Internal trident-6cff996fdb-khggp trident time="2022-12-05T14:31:22+01:00" level=info msg="Added an existing node." handler=Bootstrap node=XXXpc8XXX requestID=dc24f3fa-1193-415a-981d-1f3f2f16251c requestSource=Internal trident-6cff996fdb-khggp trident time="2022-12-05T14:31:22+01:00" level=info msg="Added an existing volume publication." handler=Bootstrap node=XXXpc6XXX requestID=dc24f3fa-1193-415a-981d-1f3f2f16251c requestSource=Internal volume=XXX . . . trident-6cff996fdb-khggp trident time="2022-12-05T14:31:22+01:00" level=info msg="Added an existing volume publication." handler=Bootstrap node=XXXpc4XXX requestID=dc24f3fa-1193-415a-981d-1f3f2f16251c requestSource=Internal volume=XXX trident-6cff996fdb-khggp trident time="2022-12-05T14:31:22+01:00" level=info msg="Added an existing volume publication." handler=Bootstrap node=XXXpc8XXX requestID=dc24f3fa-1193-415a-981d-1f3f2f16251c requestSource=Internal volume=XXX trident-6cff996fdb-khggp trident time="2022-12-05T14:31:22+01:00" level=info msg="Added an existing volume publication." handler=Bootstrap node=XXXpc7XXX requestID=dc24f3fa-1193-415a-981d-1f3f2f16251c requestSource=Internal volume=XXX trident-6cff996fdb-khggp trident time="2022-12-05T14:31:22+01:00" level=info msg="Added an existing volume publication." handler=Bootstrap node=XXXpc8XXX requestID=dc24f3fa-1193-415a-981d-1f3f2f16251c requestSource=Internal volume=XXX trident-6cff996fdb-khggp trident time="2022-12-05T14:31:22+01:00" level=error msg="Transaction monitor blocked by bootstrap error." error="Trident is initializing, please try again later" requestID=dc24f3fa-1193-415a-981d-1f3f2f16251c requestSource=Internal trident-6cff996fdb-khggp trident time="2022-12-05T14:31:22+01:00" level=info msg="Trident bootstrapped successfully." trident-6cff996fdb-khggp trident time="2022-12-05T14:31:22+01:00" level=info msg="Activating K8S helper frontend." requestID=f58b0253-a1a2-405b-9b0b-0809e5fad859 requestSource=Internal trident-6cff996fdb-khggp trident time="2022-12-05T14:31:22+01:00" level=warning msg="K8S helper could not add a storage class: storage class solidfire-default already exists" name=solidfire-default parameters="map[IOPS:5000 backendType:solidfire-san csi.storage.k8s.io/fstype:xfs provision ingType:thin snapshots:false]" provisioner=csi.trident.netapp.io requestID=14f9356d-42d2-40b4-96a8-b8f6441dc1db requestSource=Kubernetes trident-6cff996fdb-khggp trident time="2022-12-05T14:31:22+01:00" level=warning msg="K8S helper could not add a storage class: storage class solidfire-fast already exists" name=solidfire-fast parameters="map[IOPS:7000 backendType:solidfire-san csi.storage.k8s.io/fstype:xfs provisioningTyp e:thin snapshots:false]" provisioner=csi.trident.netapp.io requestID=f135f125-747f-40c4-8702-8a5c955b2f55 requestSource=Kubernetes trident-6cff996fdb-khggp trident time="2022-12-05T14:31:22+01:00" level=warning msg="K8S helper could not add a storage class: storage class solidfire-slow already exists" name=solidfire-slow parameters="map[IOPS:1500 backendType:solidfire-san csi.storage.k8s.io/fstype:xfs provisioningTyp e:thin snapshots:false]" provisioner=csi.trident.netapp.io requestID=e586a9f1-0210-4b4f-9278-0091a7f5e3ed requestSource=Kubernetes trident-6cff996fdb-khggp trident panic: runtime error: invalid memory address or nil pointer dereference trident-6cff996fdb-khggp trident [signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x2c5d9c9] trident-6cff996fdb-khggp trident trident-6cff996fdb-khggp trident goroutine 1 [running]: trident-6cff996fdb-khggp trident github.com/netapp/trident/frontend/csi/helpers/kubernetes.(*Plugin).handleFailedPVUpgrades(0xc000600800, {0x3bc4218, 0xc000398750}) trident-6cff996fdb-khggp trident /go/src/github.com/netapp/trident/frontend/csi/helpers/kubernetes/upgrade_pv.go:949 +0x129 trident-6cff996fdb-khggp trident github.com/netapp/trident/frontend/csi/helpers/kubernetes.(*Plugin).Activate(0xc000600800) trident-6cff996fdb-khggp trident /go/src/github.com/netapp/trident/frontend/csi/helpers/kubernetes/plugin.go:398 +0x4da trident-6cff996fdb-khggp trident main.main() trident-6cff996fdb-khggp trident /go/src/github.com/netapp/trident/main.go:433 +0x231d then the pod crashes. A downgrade to previous version 22.07.0 is possible and trident works fine.

Environment

To Reproduce Upgrade from 22.07.0 to 22.10.0 and have at least one volume in state = upgrading in tridentvolume, see here:

kubectl get tridentvolume -ojson

{ "apiVersion": "trident.netapp.io/v1", "backendUUID": "5ce563b7-6f62-4d8b-9c22-65d816c74938", "config": { "accessInformation": { "iscsiInterface": "default", "iscsiTargetIqn": "iqn.2010-01.com.solidfire:py74.<volume_name>.xxxx", "iscsiTargetPortal": "XXX.XXX.XXX.XXX:xxxx", "iscsiVags": [ 7 ] }, "accessMode": "ReadWriteOnce", "blockSize": "4096", "cloneSourceSnapshot": "", "cloneSourceVolume": "", "cloneSourceVolumeInternal": "", "encryption": "", "fileSystem": "ext4", "internalName": "<volume_name>", "name": "<volume_name>", "protocol": "block", "securityStyle": "", "size": "21474836480", "spaceReserve": "", "splitOnClone": "", "storageClass": "solidfire-slow", "version": "1" }, "kind": "TridentVolume", "metadata": { "creationTimestamp": "2020-06-04T09:00:21Z", "finalizers": [ "trident.netapp.io" ], "generation": 2, "name": "<volume_name>", "namespace": "kube-system", "resourceVersion": "508606680", "uid": "c9a3640e-5582-40a2-972f-fcc8978a4df1" }, "orphaned": false, "pool": "slow", "state": "upgrading" }

Expected behavior The application should throw an error message or handle this in a better way.

Additional context My guess is that this has something to do with volumes that have "state": "upgrading" when looking into tridentvolume with kubectl get tridentvolume <volume_name> -o json|yaml. The upgrade works in our dev cluster where we don't have any volume in state upgrading. No issue there with trident 22.10.0.
I also want to know what the state = upgrading means? Does it mean this volume was not migrated well to csi? Can I somehow "stop" the process when a volume is in state = upgrading and restart this process?

balaramesh commented 1 year ago

Hello @nitnatsnocER

Thank you for raising this issue.

From the log trace, it does look like Trident is experiencing issues when attempting to upgrade PV(s).

To add more detail, the panic comes when a call is made to https://github.com/NetApp/trident/blob/e0353f06b639ece753ee2f77204941daeaaf7933/frontend/csi/helpers/kubernetes/upgrade_pv.go#L949. This looks to be within handleFailedPVUpgrades, so it might be that a PV was not completely upgraded.

I would recommend taking a look at the number of volumes that are in the upgrading state. (tridentctl get volumes). If these are reasonably small, a simple workaround would be to create new PVs, copy the data, and delete the upgrading ones.

If you have not yet opened a support case, I recommend doing so here

nitnatsnocER commented 1 year ago

Hello @balaramesh, thanks for your quick response. Yes there is PV's in state upgrading but this is not shown when checking with tridentctl get volumes but when looking into the CRD tridentvolume (kubectl get tridentvolume <volume_name> -o json). Thanks for pointing out the workaround but I think it would be better if the software can handle these PV's (as it does with trident version 22.07.0 - rollback to this version works). And if I got it right the state upgrading is during migrations from etcd to CSI and this is something we did over 2 years ago so why was that never an issue but now it is? Will you fix the panic or what's your plan with this? Why should I open a support case and not handling this here in this thread?

nitnatsnocER commented 1 year ago

Hello again, I used kubectl edit tridentvolumes <volume_name> and changed the state from upgrading to online which seems to fix the issue with the panic. After all tridendvolume have the state online I could install trident 22.10.0.