ehough / docker-nfs-server

A lightweight, robust, flexible, and containerized NFS server.
https://hub.docker.com/r/erichough/nfs-server/
GNU General Public License v3.0
672 stars 221 forks source link

Getting "Stale file handle" in kubernetes deployment #25

Open flavioschuindt opened 5 years ago

flavioschuindt commented 5 years ago

Hi,

I've been using this image for a while in a Kubernetes deployment and it's working fine. I can create my NFS POD and connect it in other PODs.

However, there is one specific situation that is happening. When the Kubernetes node that has the NFS Server POD deployed starts to run out of space, i.e., it reaches 85% or more of usage, Kubernetes starts to put PODs in the evict state. This is fine and is a normal behaviour of Kubernetes. Kubernetes evicts the PODs and keep trying to reschedule new ones. Once I cleanup the files in the node to get more space, all the PODs go to a stable state again and everything from now on is on the running state, i.e., normal state. However, after it, the NFS share starts to output Stale File Handle in the shared folders that the PODs use to connect to the NFS POD.

Any idea why is it failing? While I understood by searching this issue on internet is that the stale NFS handle indicates that the client has a file open, but the server no longer recognizes the file handle (https://serverfault.com/questions/617610/stale-nfs-file-handle-after-reboot). Shouldn't the NFS Server container itself recovery of this situation. It's important to not ehere too that IPs in this case, even after the pod eviction, they don't change because they are backed by Kubernetes services.

ehough commented 5 years ago

Thanks for your question. I'm not really sure if this is a bug in the image, or just normal behavior of the NFS protocol. Neither of the RFCs for NFSv4 or NFSv3 go into very much detail about stale file handles, other than to say (in RFC 1813) that error means that:

The file referred to by that file handle no longer exists or access to it has been revoked.

I'm also not intimately familiar with Kubernetes so you might have to help me a bit. But when the kublet goes into an evict state, is it safe to assume that the NFS server container (i.e. this image) is stopped? Then later (re)started after space is freed up?

Do you get this error message after the server is back up again? Or during the outage? Both?

My hunch is that, unfortunately, the only workaround will be to gracefully detect this error on your clients and re-mount the shares :/ But I'm not giving up yet!

flavioschuindt commented 5 years ago

Thanks for the answer, @ehough!

Ok, here is how it works in Kubernetes: By default, the cluster is set up in a way that if the available free storage space in the node goes below 15%, then kubelet starts to evict the PODs. In a nutshell, kubelet is monitoring each and every POD in the cluster from time to time. When it founds some starved resource (storage in this case), a DiskPressure signal is fired and it starts to evict PODs to get the resource back.

Answering to your question: Yes, the POD (and the container as consequence) transitions for a Failed PodPhase. This means that it will terminate. The POD is not restarted after the space is freedup. Actually, kubelet keeps restarting it forever, until it get a chance to the POD be in a Running, i.e., normal state again which will only happen probably when you free space by yourself.

I received this error while the outage. When I free the space and PODs are not evicted anymore, I see all of them Running normally, but if I try to go in any of them and do a simple ls inside the nfs volume mounted in the container I still get the Stale File Handle. It seems that it seems to be in a unknow state. Then, the only way to get it back is, after the outage is gone and we have space, to delete the nfs POD and all the PODs that connects into the NFS with a kubectl delete pod command. Then all the PODs can read/write properly using the NFS ans thinsg works. But it's a manual work and I would like to understand the reason behind all of this behaviour.

shinebayar-g commented 4 years ago

I'd suggest against using NFS inside k8s cluster. When adding or removing nodes, NFS server may scheduled to another node. At the same time, pods were using NFS will no longer works properly since NFS server is gone and these pods will stuck in "Terminating" state if you want to delete them. You can use NFS inside k8s unless you won't add/remove nodes, update k8s version or bind the NFS server to specific node, but still when NFS server gets restarted, your whole pods will behave abnormal. At that point your only choice will be tear down everything and rebuild deployments from scratch again.

This is my experience from latest incident that caused 1hour downtime on my production... Right now I'm moving away my NFS server to dedicated node ..

flavioschuindt commented 4 years ago

Hi, @shinebayar-g, do you have any other idea other than nfs to share a volume between two (or multiple) pods in k8s?

shinebayar-g commented 4 years ago

I think only option to achieve ReadWriteMany volume is NFS at the moment. (at least free one?)

flavioschuindt commented 4 years ago

Yeah, I researched on this and came up with the same conclusion. You can actually use the same PVC for two different PODs. If, by luck, they got schedule at the same node in the cluster and having the same PVC, then you get only one volume created and shared. But if scheduler send each one to a different node, you end up with the same PVC, but different volumes. Other than that, I was thinking about storage cluster etc, but not sure if this could actually help in solving the sharing between different PODS in the cluster.

marxangels commented 4 years ago

Hello, I used this image to deploy a single pod NFSv4 service to share volume among multiple k8s nodes, which works well at present.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nfs4-web
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nfs4-web
  template:
    metadata:
      labels:
        app: nfs4-web
    spec:
      nodeSelector:
        nfs4-web: yes1
      volumes:
      - name: data
        persistentVolumeClaim:
          claimName: disk-web
      containers:
      - name: server
        image: erichough/nfs-server
        imagePullPolicy: IfNotPresent
        securityContext:
          privileged: true
        volumeMounts:
          - mountPath: /data
            name: data
        env:
          - name: NFS_DISABLE_VERSION_3
            value: yes1
          - name: NFS_LOG_LEVEL
            value: DEBUG
          - name: NFS_SERVER_THREAD_COUNT
            value: "6"
          - name: NFS_EXPORT_0
            value: /data *(rw,sync,fsid=0,crossmnt,no_subtree_check,no_root_squash)

---
apiVersion: v1
kind: Service
metadata:
  name: nfs4-web
spec:
  selector:
    app: nfs4-web
  type: ClusterIP
  clusterIP: 10.245.249.249
  ports:
    - port: 2049

If we change the replicas of k8s deployment from 1 to 3, in other words, multiple NFS server pods are using the same directory exports to provide services. Will this bring some unexpected problems?

shinebayar-g commented 4 years ago

@cpu100 I think you're fine. Multiple replicas or traditional multiple network mounted users are logically same. Actually NFS is the only storage provider that enables you to scale up replicas number.