Closed yanchicago closed 4 years ago
can you please try with the latest ceph-csi version and see the issue exists?
@Madhu-1 Thanks for your quick response. This is in the field. Could you help to see if there's any workaround so the manual fsck can run to recover the pod?
@Madhu-1 Thanks for your quick response. This is in the field. Could you help to see if there's any workaround so the manual fsck can run to recover the pod?
we do remove the map if the mounting fails, one thing you can try is, map this PVC on another node manually and run fsck command.
@Madhu-1 Could you point the source code so we can find the exact command for mapping the PVC?
Could you provide an example mapping command?
rbd map <username>
--keyfile <keyfile>
The code reads "rbd --id cr.id -m mon:port --keyfile cr.Keyfile map pool/rbd_image". How to retrieve cr.id, cr.keyfile?
The key file is the "ca.crt"? What's the --user or --id ?
# knc get secret rook-csi-rbd-plugin-sa-token-bqr6l -o yaml
apiVersion: v1
data:
ca.crt: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSURjakNDQWxxZ0F3SUJBZ0lJUmhGTlg5T01LMVV3RFFZSktvWklodmNOQVFFTEJRQXdVREVMTUFrR0ExVUUKQmhNQ1FVRXhDekFKQmdOVkJBZ01Ba0ZCTVFzd0NRWURWUVFIREFKQlFURUxNQWtHQTFVRUNnd0NRVUV4Q3pBSgpCZ05WQkFzTUFrRkJNUTB3Q3dZRFZRUUREQVJDUTAxVU1CNFhEVEl3TURReU16QXlNVGt4TVZvWERUSXlNRGN5Ck5UQXlNVGt4TVZvd1VERUxNQWtHQTFVRUJoTUNRVUV4Q3pBSkJnTlZCQWdNQWtGQk1Rc3dDUVlEVlFRSERBSkIKUVRFTE1Ba0dBMVVFQ2d3Q1FVRXhDekFKQmdOVkJBc01Ba0ZCTVEwd0N3WURWUVFEREFSQ1EwMVVNSUlCSWpBTgpCZ2txaGtpRzl3MEJBUUVGQUFPQ0FROEFNSUlCQ2dLQ0FRRUFyRFg1cDFhNS9RV0cwSlBxNHdUWnBHbURab1EzCjBrQzlCcDNheW9YcTkzRjdzY0ttK2dqZXNvUlRHU1lOZHo3THliVDlUM0FHdEI2eGdoN3NNVWppcGRCU3JaQVgKOXdwL1NiK2lSL0RLcjlSbWw3Y2Rua0ZqNFhyNzJMMTBPbmR1U05zSWFWckYwVE5MSlM0VHV5b3Vma0tkLzhxQwpqK3JDSExJUEwwWDBmcWZCeHE3WnUxZkJIQjRKOXB6V3J4RUsvQnJ3bGY3bGdERE93S3kxZlE4cWk2OVpDSHRmCmRNWFVsUDZnb2oxOVg3VlZLWGMzWk9LSVJ1NFN2ZkxURmZTVE5Uamt6UWxBc0ZHcjl0RVIzVnpLdjNxQlRoN2wKK1p2ZEpvbHlFdlI1WXMzUEludzRqVnZuSk1BTUpDS3JZOTM1YnZTY2hEVDVDUkNzYWRPREZZZmcxUUlEQVFBQgpvMUF3VGpBTUJnTlZIUk1FQlRBREFRSC9NQjBHQTFVZERnUVdCQlFzUmFRTzRUa243NThzMTlET01BbTlwaWo0CmpqQWZCZ05WSFNNRUdEQVdnQlFzUmFRTzRUa243NThzMTlET01BbTlwaWo0ampBTkJna3Foa2lHOXcwQkFRc0YKQUFPQ0FRRUFjbTBPVWtLNWZIaEczNzFrdkxSM1U3RlNnUmd5OHV5SkxqL0JYZ3Uyc3RvandwU0hDWXdwYUVHTgo2TGc5VTUveXRCVk9pS1IxeXc4d2xJNFIvN0xJVGQ1cUQzNk96TVZWZmlUbzZSdmZORWpyUVpNR3J5dnZCNjk2Cm9TVmwybmttUlgzbnhlOStlWDl5MjVXb2UweUpmUXIySGROc3ZvTG10UHEyYkw1Mlg1OFpnb0ZGU2Q4Y0o1U2EKUG03TmlSbUpmMG9tMFdqdmlPRVNoMUJzalBZNEhNQ2tiRUpvZElMalczWWp5Y3phU3VHUXdwRFFMQUJQc1VCUgpJWXFBUUNjRzE5TzlucFVyUDZPbjR5MlRERTlIb25hQ3I3T1VsOEhSQlZxRXFLYUhjdmpraStxQWpsalVWSnJ3ClJYUVBBaGxWSlZlVjE4U21xT0hwRTcwME1DdXZlQT09Ci0tLS0tRU5EIENFUlRJRklDQVRFLS0tLS0K
namespace: cm9vay1jZXBo
token: ZXlKaGJHY2lPaUpTVXpJMU5pSXNJbXRwWkNJNklqYzBiMVV4TTFsdE1XNUdiRnBHWVRKMWVEWmxTM1pNVkRScmFtVnBWMGczZGpWWE9GUlRWVEV0TXpBaWZRLmV5SnBjM01pT2lKcmRXSmxjbTVsZEdWekwzTmxjblpwWTJWaFkyTnZkVzUwSWl3aWEzVmlaWEp1WlhSbGN5NXBieTl6WlhKMmFXTmxZV05qYjNWdWRDOXVZVzFsYzNCaFkyVWlPaUp5YjI5ckxXTmxjR2dpTENKcmRXSmxjbTVsZEdWekxtbHZMM05sY25acFkyVmhZMk52ZFc1MEwzTmxZM0psZEM1dVlXMWxJam9pY205dmF5MWpjMmt0Y21Ka0xYQnNkV2RwYmkxellTMTBiMnRsYmkxaWNYSTJiQ0lzSW10MVltVnlibVYwWlhNdWFXOHZjMlZ5ZG1salpXRmpZMjkxYm5RdmMyVnlkbWxqWlMxaFkyTnZkVzUwTG01aGJXVWlPaUp5YjI5ckxXTnphUzF5WW1RdGNHeDFaMmx1TFhOaElpd2lhM1ZpWlhKdVpYUmxjeTVwYnk5elpYSjJhV05sWVdOamIzVnVkQzl6WlhKMmFXTmxMV0ZqWTI5MWJuUXVkV2xrSWpvaU9USTFZakk1WmpjdFlUaG1NeTAwTUdJNUxXSXlOR0l0Tm1KalpUWTVZV001TWpNNElpd2ljM1ZpSWpvaWMzbHpkR1Z0T25ObGNuWnBZMlZoWTJOdmRXNTBPbkp2YjJzdFkyVndhRHB5YjI5ckxXTnphUzF5WW1RdGNHeDFaMmx1TFhOaEluMC5LMDc1VjVhbmhyVTYwb2VtaHlCakRybDZjbzNYV3RCUlFVd0I4YjN3bUEyVExPVE4xdWMzVllsWUxPa05OLWhqSUN0RkRyd0pMaFo0NkttdjNJYjNybHJIdWNBR25VekFBZDNTTlJMLUNZRTIySnRRLXpHREljTTltQUVMS1FyR29GeFNORDJUUHF5UWFtbEpxaUd0M2lCRm5XckpWdGE5dXdvNGx5MXBTZE5hQUcxcHZzM3NFd1FCRm55ME9rSUdTMlNhWFp6MGdSWWFXUkdfQ2Z2WDdWTTZ1WFFCcVNydUJVVGJSV0ZiWHBJU1hqcW94dEotNnM3dXlYREFJRF9PaXVxbUFlOXFtbVgwS2hGVl91NkV2RDRuX2FmcXRRN2x1SmpYdlBBVGxPejY3ck1CUlBMYXQwcnJxQS1wLXBCa1pCcXRWMW1MQmY0dFBfaUdLdTYzY0E=
kind: Secret
metadata:
annotations:
kubernetes.io/service-account.name: rook-csi-rbd-plugin-sa
kubernetes.io/service-account.uid: 925b29f7-a8f3-40b9-b24b-6bce69ac9238
creationTimestamp: "2020-07-27T22:26:13Z"
name: rook-csi-rbd-plugin-sa-token-bqr6l
namespace: rook-ceph
resourceVersion: "56411039"
selfLink: /api/v1/namespaces/rook-ceph/secrets/rook-csi-rbd-plugin-sa-token-bqr6l
uid: 1845ada1-226c-43c0-a322-8e8b1db32dac
type: kubernetes.io/service-account-token
Tried with un-decoded ca.crt or decoded ca.crt. neither works.
# rbd map csireplpool/0001-0009-rook-ceph-0000000000000001-724ee4bc-d06b-11ea-860b-3654d631dc71 -m 0.254.209.61:6789,10.254.243.208:6789,10.254.63.3:6789 --user rook-csi-rbd-plugin-sa --keyfile ~/ca.crt
rbd: failed to get secret
2020-07-28 03:28:11.241 7fb6b7daab00 -1 auth: failed to decode key 'LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSURjakNDQWxxZ0F3SUJBZ0lJUmhGTlg5T01LMVV3RFFZSktvWklodmNOQVFFTEJRQXdVREVMTUFrR0ExVUUKQmhNQ1FVRXhDekFKQmdOVkJBZ01Ba0ZCTVFzd0NRWURWUVFIREFKQlFURUxNQWtHQTFVRUNnd0NRVUV4Q3pBSgpCZ05WQkFzTUFrRkJNUTB3Q3dZRFZRUUREQVJDUTAxVU1CNFhEVEl3TURReU16QXlNVGt4TVZvWERUSXlNRGN5Ck5UQXlNVGt4TVZvd1VERUxNQWtHQTFVRUJoTUNRVUV4Q3pBSkJnTlZCQWdNQWtGQk1Rc3dDUVlEVlFRSERBSkIKUVRFTE1Ba0dBMVVFQ2d3Q1FVRXhDekFKQmdOVkJBc01Ba0ZCTVEwd0N3WURWUVFEREFSQ1EwMVVNSUlCSWpBTgpCZ2txaGtpRzl3MEJBUUVGQUFPQ0FROEFNSUlCQ2dLQ0FRRUFyRFg1cDFhNS9RV0cwSlBxNHdUWnBHbURab1EzCjBrQzlCcDNheW9YcTkzRjdzY0ttK2dqZXNvUlRHU1lOZHo3THliVDlUM0FHdEI2eGdoN3NNVWppcGRCU3JaQVgKOXdwL1NiK2lSL0RLcjlSbWw3Y2Rua0ZqNFhyNzJMMTBPbmR1U05zSWFWckYwVE5MSlM0VHV5b3Vma0tkLzhxQwpqK3JDSExJUEwwWDBmcWZCeHE3WnUxZkJIQjRKOXB6V3J4RUsvQnJ3bGY3bGdERE93S3kxZlE4cWk2OVpDSHRmCmRNWFVsUDZnb2oxOVg3VlZLWGMzWk9LSVJ1NFN2ZkxURmZTVE5Uamt6UWxBc0ZHcjl0RVIzVnpLdjNxQlRoN2wKK1p2ZEpvbHlFdlI1WXMzUEludzRqVnZuSk1BTUpDS3JZOTM1YnZTY2hEVDVDUkNzYWRPREZZZmcxUUlEQVFBQgpvMUF3VGpBTUJnTlZIUk1FQlRBREFRSC9NQjBHQTFVZERnUVdCQlFzUmFRTzRUa243NThzMTlET01BbTlwaWo0CmpqQWZCZ05WSFNNRUdEQVdnQlFzUmFRTzRUa243NThzMTlET01BbTlwaWo0ampBTkJna3Foa2lHOXcwQkFRc0YKQUFPQ0FRRUFjbTBPVWtLNWZIaEczNzFrdkxSM1U3RlNnUmd5OHV5SkxqL0JYZ3Uyc3RvandwU0hDWXdwYUVHTgo2TGc5VTUveXRCVk9pS1IxeXc4d2xJNFIvN0xJVGQ1cUQzNk96TVZWZmlUbzZSdmZORWpyUVpNR3J5dnZCNjk2Cm9TVmwybmttUlgzbnhlOStlWDl5MjVXb2UweUpmUXIySGROc3ZvTG10UHEyYkw1Mlg1OFpnb0ZGU2Q4Y0o1U2EKUG03TmlSbUpmMG9tMFdqdmlPRVNoMUJzalBZNEhNQ2tiRUpvZElMalczWWp5Y3phU3VHUXdwRFFMQUJQc1VCUgpJWXFBUUNjRzE5TzlucFVyUDZPbjR5MlRERTlIb25hQ3I3T1VsOEhSQlZxRXFLYUhjdmpraStxQWpsalVWSnJ3ClJYUVBBaGxWSlZlVjE4U21xT0hwRTcwME1DdXZlQT09Ci0tLS0tRU5EIENFUlRJRklDQVRFLS0tLS0K
try to use id "admin" and get the key file in manager pod, /etc/ceph/ceph.client.admin.keyring
guess the volume is attached to the node already, could go to the node and repair the device?
@danielzhanghl The volume is not mapped, fsck failed then the volume is unmapped. So there's no device exists for manual repair. @Madhu-1 Really appreciate if you can put a little more details on how to get the id and key file and use it in the command. 1). Is it "--id" or "--user". 2). The keyflie format? Should it be decoded or keep as it is as secret.
already provided that info on slack channel https://rook-io.slack.com/archives/CG3HUV94J/p1595907147301200?thread_ts=1595878052.299900&cid=CG3HUV94J or run ceph auth ls
in the toolbox pod and use admin creds from the ceph cluster
kubectl get secrets rook-csi-rbd-node -oyaml -nrook-ceph
apiVersion: v1
data:
userID: Y3NpLXJiZC1ub2Rl
userKey: QVFDdWxCOWZjUzJFRVJBQUs1UHNYcDN3M1JFbUhrcnNGbDYyMXc9PQ==
kind: Secret
metadata:
creationTimestamp: "2020-07-28T02:59:58Z"
managedFields:
- apiVersion: v1
fieldsType: FieldsV1
fieldsV1:
f:data:
.: {}
f:userID: {}
f:userKey: {}
f:metadata:
f:ownerReferences:
.: {}
k:{"uid":"4079b742-de47-4ce2-b091-9300c8e997ff"}:
.: {}
f:apiVersion: {}
f:blockOwnerDeletion: {}
f:controller: {}
f:kind: {}
f:name: {}
f:uid: {}
f:type: {}
manager: rook
operation: Update
time: "2020-07-28T02:59:58Z"
name: rook-csi-rbd-node
namespace: rook-ceph
ownerReferences:
- apiVersion: ceph.rook.io/v1
blockOwnerDeletion: true
controller: true
kind: CephCluster
name: rook-ceph
uid: 4079b742-de47-4ce2-b091-9300c8e997ff
resourceVersion: "281811"
selfLink: /api/v1/namespaces/rook-ceph/secrets/rook-csi-rbd-node
uid: 3380f875-bb42-41a7-b51e-ccff7e59b686
type: kubernetes.io/rook
[🎩︎]mrajanna@localhost rbd $]echo Y3NpLXJiZC1ub2Rl|base64 -d
csi-rbd-node
[🎩︎]mrajanna@localhost rbd $]echo QVFDdWxCOWZjUzJFRVJBQUs1UHNYcDN3M1JFbUhrcnNGbDYyMXc9PQ==|base64 -d
AQCulB9fcS2EERAAK5PsXp3w3REmHkrsFl621w==
use the decoded value for --user
and --keyring
Many thanks for your support. :+1:We were able to recover the pod. Could you shed some light on the how the IP address was selected for the "watcher"? We've a k8s cluster using Calico CNI via IPIP encapulate mode. There're two subnets among all hosts. And the watcher IP seems randomly allocated between the two subnets. Is the watcher IP used in any way? Do you see any issues with this type of IP config?
Many thanks for your support. 👍We were able to recover the pod. Could you shed some light on the how the IP address was selected for the "watcher"? We've a k8s cluster using Calico CNI via IPIP encapulate mode. There're two subnets among all hosts. And the watcher IP seems randomly allocated between the two subnets. Is the watcher IP used in any way? Do you see any issues with this type of IP config?
This is not something cephcsi can handle, it's better to check with ceph or rook team. Closing this issue as its fixed.
@Madhu-1 Could you please shed some light on how this can happen frequently in our site?
@yanchicago have you tried with the latest ceph-csi?
@nixpanic @humblec any idea?
@Madhu-1 Unfortunately, this is in the field, can't upgrade at will.
@Madhu-1 Our version is 3.2.1, this problem still exists. This problem must be fixed manually.
@Madhu-1 Is this problem fixed?
@cl51287 I haven't come across this problem @nixpanic @humblec any idea?. Do you have a set of steps to reproduce this one?
@cl51287 can you give more details about the issue please. Are you also using calico in your setup ? and whats the issue faced at your end ? do you come across the fsck
errors ? if yes, when exactly it happens ? only when rbdplugin restarts in between? The ceph client connections are established on an IP and it has been tracked for client request and completion ...etc. If the client IP get changed frequently or while in use , I could expect some problems. But, I would like to confirm whether hostnetworking
is enabled for CSI pods in this setup or not? also in this calico setup, any changes to host IPs happen ?
Our error is the same as yanchicago, as follows: `May 14 10:27:00 k8s-test-0-114 kubelet: E0514 10:27:00.723771 3657 csi_attacher.go:320] kubernetes.io/csi: attacher.MountDevice failed: rpc error: code = Internal desc = 'fsck' found errors on device /dev/rbd3 but could not correct them: fsck from util-linux 2.32.1 May 14 10:27:00 k8s-test-0-114 kubelet: /dev/rbd3 contains a file system with errors, check forced. May 14 10:27:00 k8s-test-0-114 kubelet: /dev/rbd3: Inode 23 has an invalid extent node (blk 63775, lblk 5509)
May 14 10:27:00 k8s-test-0-114 kubelet: /dev/rbd3: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.`
The above error occurred after the physical machine was down. At the same time, a hard disk on this physical machine also appeared in this situation, which needs fsck to repair. We use the kernel mode to mount, rbdplugin restart will not appear this situation (at least we have not encountered until now) this situation so far we have only encountered it once, and we have not reproduced it in the follow-up. We did not use calico. Csi pod hostNetwork is true.
Our error is the same as yanchicago, as follows: `May 14 10:27:00 k8s-test-0-114 kubelet: E0514 10:27:00.723771 3657 csi_attacher.go:320] kubernetes.io/csi: attacher.MountDevice failed: rpc error: code = Internal desc = 'fsck' found errors on device /dev/rbd3 but could not correct them: fsck from util-linux 2.32.1 May 14 10:27:00 k8s-test-0-114 kubelet: /dev/rbd3 contains a file system with errors, check forced. May 14 10:27:00 k8s-test-0-114 kubelet: /dev/rbd3: Inode 23 has an invalid extent node (blk 63775, lblk 5509)
May 14 10:27:00 k8s-test-0-114 kubelet: /dev/rbd3: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.`
The above error occurred after the physical machine was down. At the same time, a hard disk on this physical machine also appeared in this situation, which needs fsck to repair.
This is expected! Any journalling filesystem when it goes through incomplete transactions like this at situation like node shutdown, its expected or bound to happen.
We use the kernel mode to mount, rbdplugin restart will not appear this situation (at least we have not encountered until now) this situation so far we have only encountered it once, and we have not reproduced it in the follow-up. We did not use calico. Csi pod hostNetwork is true.
This case is different or purely an illustration of node down scenario. I dont think any thing we can do from CSI side to cover this. More or less, this is working from filesystem point of view as expected/designed. !!
Yes, if the system goes down, it is expected that fsck needs to be repaired. But whether this can be automatically repaired in the csi, I see that the csi has been repaired in the log, but the repair failed. Because if this happens in the production environment, business programs will not be automatically restored, and all programs on this machine need to be repaired manually.
Yes, if the system goes down, it is expected that fsck needs to be repaired. But whether this can be automatically repaired in the csi, I see that the csi has been repaired in the log, but the repair failed. Because if this happens in the production environment, business programs will not be automatically restored, and all programs on this machine need to be repaired manually.
At time of mounting kube libraries ( csi triggers the mount though) does/attempt the fsck
operation, why it didnt go through or failed to repair I am not sure. more or less, it has been attempted in general way by the kube mounters and havent seen instances where fsck
also fails. It seems that, the situation is severe filesystem corruption as it was not just node down scenario rather the hard disk was also in trouble here. The mount libraries wont perform the operations which are beyond the default as in this case. It may require some force
options to be supplied for the repair options which by default the program like mounters stay away to avoid causing more damage.
Describe the bug
A clear and concise description of what the bug is.
Environment details
fuse
orkernel
. for rbd itskrbd
orrbd-nbd
) : krbdSteps to reproduce
Steps to reproduce the behavior:
/dev/rbd9: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY. (i.e., without -a or -p options) . E0727 17:06:53.565721 9037 utils.go:123] ID: 501049 GRPC error: rpc error: code = Internal desc = 'fsck' found errors on device /dev/rbd9 but could not correct them: fsck from util-linux 2.23.2 /dev/rbd9: Superblock needs_recovery flag is clear, but journal has data. /dev/rbd9: Run journal anyway
/dev/rbd9: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY. (i.e., without -a or -p options) .