datashim-io / datashim

A kubernetes based framework for hassle free handling of datasets
http://datashim-io.github.io/datashim
Apache License 2.0
481 stars 68 forks source link

Transport endpoint is not connected when csi-s3 pod is restarted #153

Open bbenlazreg opened 2 years ago

bbenlazreg commented 2 years ago

If for any reason the csi-s3 pod is restarted, the Pod that uses s3 volumes looses connectivity to the mount target and we get Transport endpoint is not connected error The error is solved if we restart the pod that uses the volume this forces csi-s3 pod to remount the volume.

I think when csi-s3 restarts it should check for existing volumes and remount the volume.

To reproduce this behaviour just rollout restart the deamonset Could you please take a look ?

raj-katonic commented 2 years ago

Facing the same issue

raj-katonic commented 2 years ago

Rolling out daemonsets and dataset-operator in dlf namespace altogether fixed this issue for me

bbenlazreg commented 2 years ago

Actually restarting the operator did not fix the issue for me, the only thing that fixes the issue is to restart the pod that uses the pvc created by dataset operator, but would be better if when the daemonset or operator restarts reconciles the mount, otherwise each time we update the csi provider to a new version connectivity will be lost on all pods

PS: the issue is happening for goofys and s3fs mounters

Scenario to reproduce: 1- Create an S3 dataset 2- Create a Pod mounting the pvc created by dataset 3- Restart the csi-s3 deamonset ==> Transport endpoint is not connected

attacher logs I0208 14:55:53.304459 1 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.PersistentVolume total 0 items received I0208 14:59:03.311025 1 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.VolumeAttachment total 0 items received I0208 15:02:23.307740 1 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.PersistentVolume total 0 items received I0208 15:04:48.294177 1 reflector.go:381] k8s.io/client-go/informers/factory.go:134: forcing resync I0208 15:04:48.294421 1 controller.go:208] Started VA processing "csi-0cd1a70398bbe7c6ed68a5ed04b9fa487d8ace466600da1be96e21d78b656b6d" I0208 15:04:48.294433 1 controller.go:223] Skipping VolumeAttachment csi-0cd1a70398bbe7c6ed68a5ed04b9fa487d8ace466600da1be96e21d78b656b6d for attacher blockvolume.csi.oraclecloud.com I0208 15:04:48.294192 1 reflector.go:381] k8s.io/client-go/informers/factory.go:134: forcing resync I0208 15:04:48.294462 1 controller.go:208] Started VA processing "csi-463b3945e8dd840a016d75511db98296afbaba07fbdc54f71d60f3c448afcbde" I0208 15:04:48.294467 1 controller.go:223] Skipping VolumeAttachment csi-463b3945e8dd840a016d75511db98296afbaba07fbdc54f71d60f3c448afcbde for attacher blockvolume.csi.oraclecloud.com I0208 15:04:48.294454 1 controller.go:208] Started VA processing "csi-20b7e9e5d47a9eb0c1a350d27f8c7e27c04de6d83e95128189c8eafd0a923fe5" I0208 15:04:48.294490 1 controller.go:208] Started VA processing "csi-27f260c4f75284c142c5f33aaa4d8ea8a985e82301bb80cfd76b31e8d9433db9"

Can someone please take a look on this ?

srikumar003 commented 2 years ago

Verified that this problem exists. To solve this, the CSI-S3 driver would need to be extended to support LIST_VOLUMES and LIST_VOLUMES_PUBLISHED_NODES so that the external attacher can periodically re-sync the volumes. A better option would be to support external health monitor but this may involve changing dependencies to K8s 1.22+ (see #156) as well as extending the driver.

This will be a sizeable development, so not sure about the timelines yet.

nikhil-das-katonic commented 2 years ago

Tried adding an extra argument --reconcile-sync=10s to the csi-attacher-s3 StatefulSet. This resolved the issue to some degree, though it still comes up when trying to writing large amount of files consecutively to the same bucket (PVC).

vitalif commented 2 years ago

To solve this, the CSI-S3 driver would need to be extended to support LIST_VOLUMES and LIST_VOLUMES_PUBLISHED_NODES so that the external attacher can periodically re-sync the volumes

RPC_LIST_VOLUMES_PUBLISHED_NODES is officially not a solution :-) https://github.com/kubernetes-csi/external-attacher/issues/374#issuecomment-1250930471

srikumar003 commented 2 years ago

@vitalif Thanks for researching this issue, though the answer is disappointing :-)

CSI-S3 (atleast Datashim's fork) uses Bidirectional mount propagation which has caused some issues, such as the need for privileged containers (#139) and is preventing full support for ephemeral volumes (#164). Unfortunately, we haven't been able to find a way around it, yet.

If you do have a workaround, I'll be happy to look into it.

rrehman-hbk commented 11 months ago

Any update on when this will be resolved. We are also facing this issue. We are getting this very frequently, we are mounting s3 to 5-6 pods. Whenever we do some read or write, we are getting this error, And we have to restart frequently

srikumar003 commented 11 months ago

@rrehman-hbk Could I ask under what conditions are you getting errors for read/write from S3 buckets ? This is a different problem from the one above. If you can create an issue and post the logs from your csi-s3 pods in there, then I could take a look at them

rrehman-hbk commented 11 months ago

@srikumar003 https://github.com/datashim-io/datashim/issues/324 raised a separate issue

paullryan commented 9 months ago

Also of note in this case for me if I have a livenessProbe kill the container it cannot just pickup from where it left off the whole pod must be destroyed. The CSI-S3 deamon reports the following

I0211 18:58:01.984006       1 utils.go:103] GRPC response: {"capabilities":[{"Type":{"Rpc":{"type":1}}}]}
I0211 18:58:03.164542       1 utils.go:97] GRPC call: /csi.v1.Controller/DeleteVolume
I0211 18:58:03.164743       1 utils.go:98] GRPC request: {"volume_id":"pvc-4c8d7779-a5cc-4d29-897b-f94e9ab6ca9b"}
I0211 18:58:03.164950       1 controllerserver.go:131] Deleting volume pvc-4c8d7779-a5cc-4d29-897b-f94e9ab6ca9b
E0211 18:58:03.165086       1 utils.go:101] GRPC error: failed to initialize S3 client: Endpoint:  does not follow ip address or domain name standards.

If the pod is subsequently restarted the mount the succeeds and all is fine again.

4F2E4A2E commented 8 months ago

+1

ehsan310 commented 1 week ago

+1 here , i think I am seeing the issue as well.