freegroup / kube-s3

Kubernetes pods used shared S3 storage
MIT License
207 stars 66 forks source link

Crash boot loop because of "Transport endpoint is not connected", umount -l #10

Open guysoft opened 3 years ago

guysoft commented 3 years ago

Hey, Sometimes I get an issue that pots the pod on CrashLoopBackOff. The workaround is to ssh to the node and run unmount -l (lazy), then deleted the pod and let it get created. During that time the mount is down.

Debuging results:

freegroup commented 3 years ago

I already unmount in a "preStop" hook. Did your deployment contians this step as well?

          preStop:
            exec:
              command: ["/bin/sh","-c","umount -f /var/s3"]
guysoft commented 3 years ago

Yes, it looks like this here:

          preStop:
            exec:
              command: ["bash", "-c", "umount -f /srv/website-eu/root"]

BTW, I use a sub folder, because it means that you don't loose transport connection in the pods when you re-mount. (Because they get the sub-folder changed, and not the root that they are mapping to).

I guess it should try something like

["bash", "-c", "umount -f /srv/website-eu/root || umount -l /srv/website-eu/root"]

I am not sure if that syntax works. I can try test unless you have a better suggestion.

guysoft commented 3 years ago

Found someone getting this on stackoverflow too: https://stackoverflow.com/questions/64710309/error-transport-endpoint-is-not-connected-while-using-s3fs-with-kubernetes-w

guysoft commented 3 years ago

Ok, I think I solved it. It seems like sometimes the pod looses some kinds of connection, resulting in "Transport not connected". The workaround I found to fix this is to add an init container, that tries to unmount the folder before. That seems to fix the issue. Will let it run and see if it comes back:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  labels:
    app: s3-provider
  name: s3-provider
spec:
  selector:
    matchLabels:
      app: s3-provider
  template:
    metadata:
      labels:
        app: s3-provider
    spec:
      initContainers:
      - name: init-myservice
        image: bash
        command: ['bash', '-c', 'umount -l /mnt/data-s3-fs/root ; true']
        securityContext:
          privileged: true
          capabilities:
            add:
            - SYS_ADMIN
        # use ALL  entries in the config map as environment variables
        envFrom:
        - configMapRef:
            name: s3-config
        volumeMounts:
        - name: devfuse
          mountPath: /dev/fuse
        - name: mntdatas3fs-init
          mountPath: /mnt:shared
      containers:
      - name: s3fuse
        image: 963341077747.dkr.ecr.us-east-1.amazonaws.com/kube-s3:1.0
        imagePullPolicy: Always
        lifecycle:
          preStop:
            exec:
              command: ["bash", "-c", "umount -f /srv/s3-mount/root"]
        securityContext:
          privileged: true
          capabilities:
            add:
            - SYS_ADMIN
        # use ALL  entries in the config map as environment variables
        envFrom:
        - configMapRef:
            name: s3-config
        env:
        - name: S3_BUCKET
          value: s3-mount
        - name: MNT_POINT
          value: /srv/s3-mount/root
        - name: IAM_ROLE
          value: none
        volumeMounts:
        - name: devfuse
          mountPath: /dev/fuse
        - name: mntdatas3fs
          mountPath: /srv/s3-mount/root:shared
      volumes:
      - name: devfuse
        hostPath:
          path: /dev/fuse
      - name: mntdatas3fs
        hostPath:
          type: DirectoryOrCreate
          path: /mnt/data-s3-fs/root
      - name: mntdatas3fs-init
        hostPath:
          type: DirectoryOrCreate
          path: /mnt
gaul commented 3 years ago

Transport endpoint is not connected

Usually means that s3fs exited unexpectedly. I would check to see if the process is running. If not it would help to gather the logs or attach gdb before the crash to get a backtrace.

guysoft commented 3 years ago

I saw no logs. both in describe and in the logs. It might be a GDB traceable issue, though since I found no way to reproduce this without just waiting for it to happen. I am not sure. Its also hard to hold this kind of process under tracing just before it crashes, because I don't know how to cause it.

fenwuyaoji commented 1 year ago

I think most crashes are caused by the resource competition like CPU or memory lack. Something like below should be add to the yaml: resources: limits: cpu: "2" memory: 8Gi requests: cpu: "1" memory: 4Gi