Closed scele closed 2 years ago
Accompanying bb-remote-execution PR: https://github.com/buildbarn/bb-remote-execution/pull/95
My use case is that I'm trying to run bb-clientd as a sidecar container in k8s, and trying to make it "well-behaved" for mount propagation. I guess this PR is not strictly sufficient, since the mount would not be cleaned up if bb-clientd crashes, but thought it might still be a worthwhile UX improvement...? Let me know what you think.
Hi @scele!
Even though I sympathise to the change, I have also observed that a change like this is not sufficient. It is indeed the case that kubelet will lock up if you leave a mount behind in an emptyDir
that's shared between containers. This logic will only take care of the graceful case, but as you mentioned, in case of crashes/involuntary terminations, it still goes wrong. In my opinion we should file a bug against Kubernetes for that.
That said, I can share with you that if you stick to the following rules, Kubernetes won't lock up:
hostPath
instead of an emptyDir
to hold the FUSE mount.hostPath
volume; don't place it at the root.subPath
, not on the bb_clientd side, nor the Bazel side.In addition to that, apply the following rule to make sure you don't end up with stale FUSE mount entries in the host's fstab (i.e., outside Kubernetes):
hostPath
identical to the original path on the host file system.I'd rather not make a change like this, as there is no way we can get it perfect on the bb_clientd side. In my opinion it should be up to the execution environment/launch service/... to clean up any straggling resources, as it's the only piece of software that can get it right.
Thanks @EdSchouten for these pointers and quick response!
Hmm.. trying to understand how this would work.
hostPath
, how would it work if I have multiple Pods on one Node? Would I need to include the Pod name in the FUSE mount path, something like
fuse: {
mountPath: "/run/bb_clientd_mounts/$POD_NAME/fusemount"
}
with a hostPath
mount of /run/bb_clientd_mounts:/run/bb_clientd_mounts
mounted to each Pod? Or do you only have experience from one-pod-per-node configurations?
unmount
even when using hostPath
volume, right? Taking it one layer up from bb-clientd, my next attempt would be to adjust the bb-clientd container CMD
to be something like trap EXIT { umount /run/bb_clientd_mounts/$POD_NAME/fusemount } && bb-clientd /configs/bb_clientd.jsonnet
so that the unmount would happen even if bb-clientd crashes. That still would not work if somebody forcibly kills the whole container, but I don't see how to do better than that...?mountPropagation=Bidirectional
together with hostPath
mounts, right?Don't worry if you don't have answers to these, I can experiment myself. Just in case you have experience about some working configuration.
Re. questions 1+2:
With regards to multiple pods on one node, I have no experience with that yet. Using the pod name could work, but that has the downside that your bb_clientd_mounts directory fills up over time. It would be nice if Kubernetes had some kind of node-local number pool/allocation feature, so that the number of directories you'd need to create remains bounded. That also means that you don't need to care too much about doing unmounts.
Wild idea: Kubernetes already has this feature:
https://kubernetes.io/docs/concepts/policy/resource-quotas/#resource-quota-for-extended-resources
Would there be some way for kubelet to report with n'th of those got allocated?
With regards to question 3: Yes, you still need to use mount propagation.
That said, I would still suggest talking to the Kubernetes developers to ask them why Kubelet is incapable of cleaning up stale mounts. That would prevent the need for hostPath
entirely.
Thank you Ed for these suggestions! Closing this PR.
Cleaning up the FUSE mount point on graceful termination can be beneficial. For example, when running bb_clientd under Kubernetes with mountPropagation=Bidirectional, it is advised that "any volume mounts created by containers in pods must be destroyed (unmounted) by the containers on termination" (https://kubernetes.io/docs/concepts/storage/volumes/#mount-propagation).
This also improves user ergonomics, since manual unmounting after clean (e.g. "Ctrl-C" or "docker stop") termination of bb_clientd is no longer required.