Open bbenlazreg opened 2 years ago
Facing the same issue
Rolling out daemonsets and dataset-operator in dlf namespace altogether fixed this issue for me
Actually restarting the operator did not fix the issue for me, the only thing that fixes the issue is to restart the pod that uses the pvc created by dataset operator, but would be better if when the daemonset or operator restarts reconciles the mount, otherwise each time we update the csi provider to a new version connectivity will be lost on all pods
PS: the issue is happening for goofys and s3fs mounters
Scenario to reproduce: 1- Create an S3 dataset 2- Create a Pod mounting the pvc created by dataset 3- Restart the csi-s3 deamonset ==> Transport endpoint is not connected
attacher logs
I0208 14:55:53.304459 1 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.PersistentVolume total 0 items received I0208 14:59:03.311025 1 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.VolumeAttachment total 0 items received I0208 15:02:23.307740 1 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.PersistentVolume total 0 items received I0208 15:04:48.294177 1 reflector.go:381] k8s.io/client-go/informers/factory.go:134: forcing resync I0208 15:04:48.294421 1 controller.go:208] Started VA processing "csi-0cd1a70398bbe7c6ed68a5ed04b9fa487d8ace466600da1be96e21d78b656b6d" I0208 15:04:48.294433 1 controller.go:223] Skipping VolumeAttachment csi-0cd1a70398bbe7c6ed68a5ed04b9fa487d8ace466600da1be96e21d78b656b6d for attacher blockvolume.csi.oraclecloud.com I0208 15:04:48.294192 1 reflector.go:381] k8s.io/client-go/informers/factory.go:134: forcing resync I0208 15:04:48.294462 1 controller.go:208] Started VA processing "csi-463b3945e8dd840a016d75511db98296afbaba07fbdc54f71d60f3c448afcbde" I0208 15:04:48.294467 1 controller.go:223] Skipping VolumeAttachment csi-463b3945e8dd840a016d75511db98296afbaba07fbdc54f71d60f3c448afcbde for attacher blockvolume.csi.oraclecloud.com I0208 15:04:48.294454 1 controller.go:208] Started VA processing "csi-20b7e9e5d47a9eb0c1a350d27f8c7e27c04de6d83e95128189c8eafd0a923fe5" I0208 15:04:48.294490 1 controller.go:208] Started VA processing "csi-27f260c4f75284c142c5f33aaa4d8ea8a985e82301bb80cfd76b31e8d9433db9"
Can someone please take a look on this ?
Verified that this problem exists. To solve this, the CSI-S3 driver would need to be extended to support LIST_VOLUMES
and LIST_VOLUMES_PUBLISHED_NODES
so that the external attacher can periodically re-sync the volumes. A better option would be to support external health monitor but this may involve changing dependencies to K8s 1.22+ (see #156) as well as extending the driver.
This will be a sizeable development, so not sure about the timelines yet.
Tried adding an extra argument --reconcile-sync=10s
to the csi-attacher-s3
StatefulSet. This resolved the issue to some degree, though it still comes up when trying to writing large amount of files consecutively to the same bucket (PVC).
To solve this, the CSI-S3 driver would need to be extended to support
LIST_VOLUMES
andLIST_VOLUMES_PUBLISHED_NODES
so that the external attacher can periodically re-sync the volumes
RPC_LIST_VOLUMES_PUBLISHED_NODES is officially not a solution :-) https://github.com/kubernetes-csi/external-attacher/issues/374#issuecomment-1250930471
@vitalif Thanks for researching this issue, though the answer is disappointing :-)
CSI-S3 (atleast Datashim's fork) uses Bidirectional
mount propagation which has caused some issues, such as the need for privileged containers (#139) and is preventing full support for ephemeral volumes (#164). Unfortunately, we haven't been able to find a way around it, yet.
If you do have a workaround, I'll be happy to look into it.
Any update on when this will be resolved. We are also facing this issue. We are getting this very frequently, we are mounting s3 to 5-6 pods. Whenever we do some read or write, we are getting this error, And we have to restart frequently
@rrehman-hbk Could I ask under what conditions are you getting errors for read/write from S3 buckets ? This is a different problem from the one above. If you can create an issue and post the logs from your csi-s3
pods in there, then I could take a look at them
@srikumar003 https://github.com/datashim-io/datashim/issues/324 raised a separate issue
Also of note in this case for me if I have a livenessProbe kill the container it cannot just pickup from where it left off the whole pod must be destroyed. The CSI-S3 deamon reports the following
I0211 18:58:01.984006 1 utils.go:103] GRPC response: {"capabilities":[{"Type":{"Rpc":{"type":1}}}]}
I0211 18:58:03.164542 1 utils.go:97] GRPC call: /csi.v1.Controller/DeleteVolume
I0211 18:58:03.164743 1 utils.go:98] GRPC request: {"volume_id":"pvc-4c8d7779-a5cc-4d29-897b-f94e9ab6ca9b"}
I0211 18:58:03.164950 1 controllerserver.go:131] Deleting volume pvc-4c8d7779-a5cc-4d29-897b-f94e9ab6ca9b
E0211 18:58:03.165086 1 utils.go:101] GRPC error: failed to initialize S3 client: Endpoint: does not follow ip address or domain name standards.
If the pod is subsequently restarted the mount the succeeds and all is fine again.
+1
+1 here , i think I am seeing the issue as well.
If for any reason the csi-s3 pod is restarted, the Pod that uses s3 volumes looses connectivity to the mount target and we get
Transport endpoint is not connected
error The error is solved if we restart the pod that uses the volume this forces csi-s3 pod to remount the volume.I think when csi-s3 restarts it should check for existing volumes and remount the volume.
To reproduce this behaviour just rollout restart the deamonset Could you please take a look ?