Closed andersosthus closed 3 years ago
@andersosthus is this cluster is in a healthy state? also the above logs are having only error logs it will be difficult to analyze the issue. can you please increase the log level from 0
to 5
[rook}(https://github.com/rook/rook/blob/master/cluster/examples/kubernetes/ceph/operator.yaml#L35) and provide the logs again?
and also can you run the umount
command manually and see what error you are getting?
The cluster is healthy now, but we do get spikes of "MDS_SLOW_REQUESTS" that causes everything to grind to an halt.
I did an umount
in my second example above, and it unmounted successfully without any errors. In the first example, the volume wasn't mounted at all.
I'll increase the log level and get you some better logs.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.
Describe the bug
In our ceph-csi logs, we see errors that looks like this:
2020-10-22 12:40:26 | E1022 10:40:26.173335 1 utils.go:163] ID: 2262 Req-ID: 0001-0009-rook-ceph-0000000000000001-1ff293cc-1397-11eb-9662-0a580ac80704 GRPC error: rpc error: code = Internal desc = an error (exit status 32) occurred while running umount args: [/var/lib/kubelet/pods/9fd8d5d0-13bf-11eb-975b-0a362473693d/volumes/kubernetes.io~csi/pvc-1f462319-1397-11eb-97a0-0258a0a0acd1/mount]
It occurs in several ceph-csi containers (but not all), and over the last 15 minutes, we have 300 entries of this in the logs. I've investigated a few of them. The first one I looked at, the pvc in question was not mounted on the node, but the directory still existed in
/var/lib/kubelet/plugins/kubernetes.io/csi/pv/
. The errors lasted for about 1 hour. Full log of that event is below marked as [1].I then investigated another instance of this error. This time, the volume was still mounted on the node, and I did a manual umount of it without error. Looking at the
ceph-csi
logs, for this specific issue, it has been going on for over 12 hours. Logs from ceph-csi below marked as [2] (I've cut out a lot of duplicate logs in the middle).Not really sure if this has any direct effect on Ceph or not, but we've had issues with CephFS lately, so we're looking into everything that seems not normal, and thus we found this in the logs.
If any more logs are needed, let me know.
Environment details
fuse
orkernel
. for rbd itskrbd
orrbd-nbd
) : kernelLogs
[1] csi-cephfsplugin:
[2] csi-cephfsplugin:
[3] driver-registrar for pod used in [2] above: