Open kvaps opened 1 year ago
@ghernadi seems similar issue like https://github.com/LINBIT/linstor-server/issues/100, LINSTOR can't upgrade DB due to fact we're using luks encryption
Not sure what happened here. I tried to create a Luks resource in a k8s setup with 1.20.3 and upgraded afterwards to 1.21.0 without an issue. Care to send me a database dump so I can see if I can at least fix that for you?
Also any idea what triggered this issue? I will try to create Luks resources from older version and migrate up to 1.21.0, but any additional information could help.
I was successful to rollback latest backup and try to upgrade again, and successfully upgraded. I sent broken db dump to @ghernadi for future analysis.
Thank you!
Thanks for the DB dump. The resource pvc-5bb88aa7-a018-4428-87c3-afb05e2cb5d5
on b-hv-2
fails to load (it is the same resources as in #350, but the issue in #350 is on the node b-hv-3
, not b-hv-2
as here). The problem is that the resource on b-hv-2
was in the middle of a toggle-disk (to diskless) process when the controller crashed / was shut down. (This could be part of the cleanup process of auto-diskful).
Linstor apparently has some issues with loading data from the database in such a toggle-disk state, especially with non default (drbd,storage
) setup - in your case that is drbd,luks,storage
. I think I have a fix for that, but I would like to test it some more.
@ghernadi thanks for guidance, here is my journey for fixing this issue.
short disclaimer if someone thinks to repeat some of my steps: PLEASE DO IT ON YOUR OWN RISK
First, I made a backup of all linstor resources:
kubectl get crds | grep -o ".*.internal.linstor.linbit.com" | xargs kubectl get crds -ojson > crds.json
kubectl get crds | grep -o ".*.internal.linstor.linbit.com" | xargs -i{} sh -xc "kubectl get {} -ojson > {}.json"
Next I found all the resouces in weird states:
0 is diskful diskless have either 260 (diskless+drbd_diskless) or 388 (diskless+drbd_diskless+tiebreaker) if you check the flags in binary, 388 would be 1 1000 0100. the flags 1 1xxx x100 could also be set, which are some toggle-disk states.
cat resources.internal.linstor.linbit.com.json | jq '.items[] | select(.spec.resource_flags!=0 and .spec.resource_flags!=260 and .spec.resource_flags!=388) | "\(.spec.resource_name) \(.spec.node_name) \(.spec.resource_flags)"' -r > list.txt
Finaly I've got list.txt of resources and their flags:
PVC-35C23274-3F53-47AD-ACC8-B01AD2E7E2A0 A-HV-1 527844
PVC-F9AB4C0F-0B8C-43EB-89DC-E5DBD84FE5E6 C-HV-4 2048
PVC-4E76FCFF-3293-4A5A-945F-7B4179DF74BE A-HV-1 265700
PVC-5BB88AA7-A018-4428-87C3-AFB05E2CB5D5 B-HV-3 263524
PVC-1DD492B4-9A89-4581-B20F-178DCE17200B B-HV-3 265572
PVC-7EE791AA-A155-4744-ABB8-DF1486A1F879 A-HV-1 789988
PVC-7CF9436F-497F-4BEC-8F6D-94708FFD5090 C-HV-5 787456
PVC-4E76FCFF-3293-4A5A-945F-7B4179DF74BE C-HV-4 263168
Then I decided to check if they are really diskful. Obviously they must have some storage layers:
while read res node flags; do cat layerresourceids.internal.linstor.linbit.com.json | jq -c '.items[] | select(.spec.resource_name==$res and .spec.node_name==$node)' --arg res $res --arg node $node | grep -q '{' && echo $res $node diskful || echo $res $node diskless; done < list.txt
Likely all of them turned out to be diskful
PVC-35C23274-3F53-47AD-ACC8-B01AD2E7E2A0 A-HV-1 diskful
PVC-F9AB4C0F-0B8C-43EB-89DC-E5DBD84FE5E6 C-HV-4 diskful
PVC-4E76FCFF-3293-4A5A-945F-7B4179DF74BE A-HV-1 diskful
PVC-5BB88AA7-A018-4428-87C3-AFB05E2CB5D5 B-HV-3 diskful
PVC-1DD492B4-9A89-4581-B20F-178DCE17200B B-HV-3 diskful
PVC-7EE791AA-A155-4744-ABB8-DF1486A1F879 A-HV-1 diskful
PVC-7CF9436F-497F-4BEC-8F6D-94708FFD5090 C-HV-5 diskful
PVC-4E76FCFF-3293-4A5A-945F-7B4179DF74BE C-HV-4 diskful
Okay, then I prepared fix, to change their flags to 0
:
while read res node flags; do cat resources.internal.linstor.linbit.com.json | jq '.items[] | select(.spec.resource_name==$res and .spec.node_name==$node) | .spec.resource_flags=0' --arg res $res --arg node $node -r ; done < list.txt > fix.json
Applied it:
kubectl delete -f fix.json
kubectl apply -f fix.json
Started controller, and hooray it was loaded successfully, however some of these resource were still persisted in Unknown
state:
linstor --controllers 127.0.0.1 r l | grep Unknown
| pvc-1dd492b4-9a89-4581-b20f-178dce17200b | b-hv-3 | 7021 | | | Unknown | 2023-03-09 08:35:22 |
| pvc-4e76fcff-3293-4a5a-945f-7b4179df74be | a-hv-1 | 7022 | | | Unknown | 2023-03-09 08:36:08 |
| pvc-5bb88aa7-a018-4428-87c3-afb05e2cb5d5 | b-hv-3 | 7023 | | | Unknown | 2023-04-11 15:47:30 |
| pvc-7ee791aa-a155-4744-abb8-df1486a1f879 | a-hv-1 | 7017 | Unused | | Unknown | 2023-03-09 08:19:03 |
| pvc-35c23274-3f53-47ad-acc8-b01ad2e7e2a0 | a-hv-1 | 7024 | Unused | | Unknown | 2023-03-09 08:36:08 |
There were no opportunity to delete them even after linstor-satellite restart:
# linstor r d b-hv-3 pvc-1dd492b4-9a89-4581-b20f-178dce17200b
INFO:
Resource-definition property 'DrbdOptions/Resource/quorum' updated from 'off' to 'majority' by auto-quorum
INFO:
Resource-definition property 'DrbdOptions/Resource/on-no-quorum' updated from 'off' to 'suspend-io' by auto-quorum
SUCCESS:
Description:
Node: b-hv-3, Resource: pvc-1dd492b4-9a89-4581-b20f-178dce17200b preparing for deletion.
Details:
Node: b-hv-3, Resource: pvc-1dd492b4-9a89-4581-b20f-178dce17200b UUID is: 55c99c65-2c7e-435b-ba08-f1a6b68c345f
ERROR:
(Node: 'b-hv-3') An unknown exception occurred while processing the resource pvc-1dd492b4-9a89-4581-b20f-178dce17200b
Show reports:
linstor error-reports show 64380C9B-9841A-000074
SUCCESS:
Preparing deletion of resource on 'madison-db-1'
SUCCESS:
Preparing deletion of resource on 'a-hv-1'
ERROR:
Description:
Deletion of resource 'pvc-1dd492b4-9a89-4581-b20f-178dce17200b' on node 'b-hv-3' failed due to an unknown exception.
Details:
Node: b-hv-3, Resource: pvc-1dd492b4-9a89-4581-b20f-178dce17200b
Show reports:
linstor error-reports show 64386E36-00000-000000
Thus I decided to find and remove all of them from database.
This short script will generate the name of all related objects to this resource in Kubernetes CRs
#/bin/bash
res=pvc-1dd492b4-9a89-4581-b20f-178dce17200b
node=b-hv-3
cat *.internal.linstor.linbit.com.json | jq -c '.items[] | select(.spec.resource_name==$res and .spec.node_name==$node)' --arg res ${res^^} --arg node ${node^^} | jq -r '"\(.kind).\(.apiVersion|split("/")[0])/\(.metadata.name)"'
example output:
LayerResourceIds.internal.linstor.linbit.com/284de502c9847342318c17d474733ef468fbdbe252cddf6e4b4be0676706d9d0
LayerResourceIds.internal.linstor.linbit.com/68519a9eca55c68c72658a2a1716aac3788c289859d46d6f5c3f14760fa37c9e
LayerResourceIds.internal.linstor.linbit.com/734d0759cdb4e0d0a35e4fd73749aee287e4fdcc8648b71a8d6ed591b7d4cb3f
Resources.internal.linstor.linbit.com/a05c4b07d52a83fb69482d51df83399adc7eceb3824f4c79e9d097c14e63ef36
Volumes.internal.linstor.linbit.com/06eef302015d08b2095174dc1d701a6d6131caac77be96a8c5fd99505214d96f
so they can be simple removed using kubectl delete
command.
When I started controller after that, all unwanted resources were gone :tada: