ccremer / kubernetes-zfs-provisioner

Dynamic ZFS persistent volume provisioner for Kubernetes
Apache License 2.0
74 stars 7 forks source link

Automatically use hostpath if pod is on same host as zfs pool #85

Open morganchristiansson opened 1 year ago

morganchristiansson commented 1 year ago

This has been bugging me.

I'm currently using hostpath-provisioner and have /nfs/hostpath on all nodes. It's a nfs mount on all nodes except the zfs node and works elegantly. But I would rather the provisioner did zfs create for every pv/pvc.

Not sure if performance is worse when using nfs on localhost but nonetheless it would nice if it automatically switched to hostpath.

Maybe using type: auto (default?)

kind: StorageClass
apiVersion: storage.k8s.io/v1
parameters:
  type: auto
ccremer commented 1 year ago

Hi. I'm not sure I understand your problem fully. Am I reading correctly you have a single ZFS host that is also part of the cluster as a worker node? And all other worker nodes don't have ZFS, but instead mount the NFS export that the single ZFS host exposes?

In general, the provisioner has no idea of any pods that might or might not use a PVC. It doesn't know on which node a pod gets scheduled, rather it creates PVs (when PVCs get created) that may add restrictions where pods can even be scheduled to begin with (the case with hostpath).

I don't know if there's a performance impact if you mount a volume over NFS instead of bind mount on the same machine. You'd have to benchmark it yourselves with your application.

Why is it not good enough to just create a storageclass of type nfs? Is it just for the performance concern?

morganchristiansson commented 1 year ago

Yes you understand correctly. the ZFS host is the kubernetes master and also runs pods. The workers are diskless Raspberry Pis with nfs.

I guess it would be specific to pods mounting the PV/PVC, creation shouldn't be different depending on type.

Yes just performance and maybe cleaner to mount directly without nfs.

ccremer commented 1 year ago

Thanks for the explanation. Unfortunately I don't think that's possible. A PVC doesn't provide information on specific nodes, so it can't really determine if a volume should be hostpath or nfs or something else, so a provisioner will have a hard time to differentiate. Sure, PVCs can have annotations about the node they're supposed to be provisioned on, but then you might as well use a different storage class. If the mount approach is a concern for you, you're left with having a specific PVC and deployment that are only schedulable on the master, while the other pods use the NFS type.

Besides, I'm a bit hesitant to implement a feature out of a performance concern when I don't have any actual numbers. Please try it out with NFS on the same machine. It's possible that we are talking about a non-issue from a practical PoV,

morganchristiansson commented 1 year ago

Fair enough.

Maybe the info is available at the point where the pod is mounting the pvc? At pvc creation time it wouldn't be available agreed.

Some quick googling suggests there has been problems but may be working fine now. I'll need to test it and come back to be sure...

Quote:

Traditionally, the practice of nfs loopback mounting has not been recommended or supported in any Linux environment. There are known problems with nfs loopback mounts. The problems deal with deadlocks which can occur due to conflicts that arise between memory allocation, memory freeing, and memory write out. Because of the potential for deadlocks, loopback mounting has been generally considered unsupported by all of the Linux community.

On SLES 12 and 15, improvements to NFSv3 allow loopback mounts to be supported. Note that this support does not apply to NFSv4 https://www.suse.com/support/kb/doc/?id=000018709

ccremer commented 1 year ago

So I was quickly looking at the code again to see available options. It seems the provision controller library may actually pass the scheduled node:

https://github.com/kubernetes-sigs/sig-storage-lib-external-provisioner/blob/a2f2cebc05acc2a003772096c46965cf7ad2ee4e/controller/volume.go#L115-L133

I'm not sure how that's going to help in practice, but it may be worth testing out.

morganchristiansson commented 1 year ago

I did not realise that no code runs when creating new pods. Interesting. I found createVolumeSource() and see that there is no code execution at point of pod mounting the pvc https://github.com/ccremer/kubernetes-zfs-provisioner/blob/master/pkg/provisioner/provision.go#L84-L105

I'm using a simple solution where I have /nfs/hostpath mounted on all nodes in my cluster and and using https://github.com/rimusz/hostpath-provisioner/ (based on sig-storage-lib-external-provisioner/examples/hostpath-provisioner) which creates directory per pv under a root path and uses hostPath. But I want zfs create dataset per pv for stats, snapshotting and management. A similar solution could work here by mounting parentDataset on every node and then use hostPath for all PVs? If it fits with this project..

Wish I could suggest more... :smile:

ccremer commented 1 year ago

I did not realise that no code runs when creating new pods.

Yes, that's what I tried to explain in earlier comments, but apparently I didn't do a good job :)

A similar solution could work here by mounting parentDataset on every node and then use hostPath for all PVs

From what I know, this one doesn't fit. It means that the provisioner has to connect to the node and mount the parentDataset via NFS first, and then provide a VolumeSource that has hostpath. Kubernetes would be "unaware" that the path is actually mounted via NFS. If for any reason, after a reboot, the NFS mount cannot be re-established, Kubernetes just creates the new hostpath and you start with an empty directory...

This is a mechanism that I honestly don't want to maintain in this project with my already limited spare time. There's just too many moving parts.

The only thing I could consider is my earlier suggestion: Check if we have the node information in the options, and if it matches the node in the storage class, return a hostpath instead of NFS volume source.