buildbarn / bb-clientd

Buildbarn client-side FUSE/NFSv4 daemon
Apache License 2.0
39 stars 11 forks source link

Attempt to unmount FUSE on process exit #2

Closed scele closed 2 years ago

scele commented 2 years ago

Cleaning up the FUSE mount point on graceful termination can be beneficial. For example, when running bb_clientd under Kubernetes with mountPropagation=Bidirectional, it is advised that "any volume mounts created by containers in pods must be destroyed (unmounted) by the containers on termination" (https://kubernetes.io/docs/concepts/storage/volumes/#mount-propagation).

This also improves user ergonomics, since manual unmounting after clean (e.g. "Ctrl-C" or "docker stop") termination of bb_clientd is no longer required.

scele commented 2 years ago

Accompanying bb-remote-execution PR: https://github.com/buildbarn/bb-remote-execution/pull/95

My use case is that I'm trying to run bb-clientd as a sidecar container in k8s, and trying to make it "well-behaved" for mount propagation. I guess this PR is not strictly sufficient, since the mount would not be cleaned up if bb-clientd crashes, but thought it might still be a worthwhile UX improvement...? Let me know what you think.

EdSchouten commented 2 years ago

Hi @scele!

Even though I sympathise to the change, I have also observed that a change like this is not sufficient. It is indeed the case that kubelet will lock up if you leave a mount behind in an emptyDir that's shared between containers. This logic will only take care of the graceful case, but as you mentioned, in case of crashes/involuntary terminations, it still goes wrong. In my opinion we should file a bug against Kubernetes for that.

That said, I can share with you that if you stick to the following rules, Kubernetes won't lock up:

In addition to that, apply the following rule to make sure you don't end up with stale FUSE mount entries in the host's fstab (i.e., outside Kubernetes):

I'd rather not make a change like this, as there is no way we can get it perfect on the bb_clientd side. In my opinion it should be up to the execution environment/launch service/... to clean up any straggling resources, as it's the only piece of software that can get it right.

scele commented 2 years ago

Thanks @EdSchouten for these pointers and quick response!

Hmm.. trying to understand how this would work.

  1. I'm trying to have a (Jenkins) Pod that has bazel in one container and bb-clientd in another container. By using a hostPath, how would it work if I have multiple Pods on one Node? Would I need to include the Pod name in the FUSE mount path, something like
    fuse: {
      mountPath: "/run/bb_clientd_mounts/$POD_NAME/fusemount"
    }

    with a hostPath mount of /run/bb_clientd_mounts:/run/bb_clientd_mounts mounted to each Pod? Or do you only have experience from one-pod-per-node configurations?

  2. Something still needs to call unmount even when using hostPath volume, right? Taking it one layer up from bb-clientd, my next attempt would be to adjust the bb-clientd container CMD to be something like trap EXIT { umount /run/bb_clientd_mounts/$POD_NAME/fusemount } && bb-clientd /configs/bb_clientd.jsonnet so that the unmount would happen even if bb-clientd crashes. That still would not work if somebody forcibly kills the whole container, but I don't see how to do better than that...?
  3. I still need to use mountPropagation=Bidirectional together with hostPath mounts, right?

Don't worry if you don't have answers to these, I can experiment myself. Just in case you have experience about some working configuration.

EdSchouten commented 2 years ago

Re. questions 1+2:

With regards to multiple pods on one node, I have no experience with that yet. Using the pod name could work, but that has the downside that your bb_clientd_mounts directory fills up over time. It would be nice if Kubernetes had some kind of node-local number pool/allocation feature, so that the number of directories you'd need to create remains bounded. That also means that you don't need to care too much about doing unmounts.

Wild idea: Kubernetes already has this feature:

https://kubernetes.io/docs/concepts/policy/resource-quotas/#resource-quota-for-extended-resources

Would there be some way for kubelet to report with n'th of those got allocated?

With regards to question 3: Yes, you still need to use mount propagation.

That said, I would still suggest talking to the Kubernetes developers to ask them why Kubelet is incapable of cleaning up stale mounts. That would prevent the need for hostPath entirely.

scele commented 2 years ago

Thank you Ed for these suggestions! Closing this PR.