Open jjkr opened 9 months ago
We have the same issue and attempted to workaround it by using an init container to create the cache directory on the node like in the following example: (I didn't provide the pv config in this example, but it was configured to cache dir on /tmp/s3-cache
)
apiVersion: v1
kind: Pod
metadata:
name: s3-app
spec:
initContainers:
- name: create-cache-dir
image: centos
command: ["/bin/sh"]
args: ["-c", "mkdir -p /cache-dir/s3-cache; echo 'hi' > /cache-dir/s3-cache/test.txt"]
volumeMounts:
- name: cache-location
mountPath: /cache-dir
containers:
- name: app
image: centos
command: ["/bin/sh"]
args: ["-c", "ls -lR /data; sleep 99"]
volumeMounts:
- name: persistent-storage
mountPath: /data
volumes:
- name: persistent-storage
persistentVolumeClaim:
claimName: s3-pvc
- name: cache-location
hostPath:
path: /tmp/
This example DOES NOT work - as k8s attempts to mount the s3 volume even before the init container.
Based on the example here: https://github.com/awslabs/mountpoint-s3-csi-driver/blob/main/examples/kubernetes/static_provisioning/caching.yaml
I worked around this issue by using a hostPath mount to create the directory (if not exist) on host.
Regardless of the order of volumeMount or volumes, it will automatically retry until it is mounted. But I put it before the pvc mount in case it does this in the order specified. From my testing, pod comes up immediately.
apiVersion: v1
kind: Pod
metadata:
name: s3-app
spec:
containers:
- name: app
image: centos
command: ["/bin/sh"]
args: ["-c", "echo 'Hello from the container!' >> /data/$(date -u).txt; tail -f /dev/null"]
volumeMounts:
- name: cache-location
mountPath: /tmp/pv
- name: persistent-storage
mountPath: /data
volumes:
- name: cache-location
hostPath:
path: /tmp/s3-pv1-cache
type: DirectoryOrCreate
- name: persistent-storage
persistentVolumeClaim:
claimName: s3-claim
I'm working around this using a k8s job.
apiVersion: batch/v1
kind: Job
metadata:
name: s3-cache-create
namespace: kube-system
spec:
template:
spec:
containers:
- name: busybox
image: busybox
command:
- mkdir
- "-p"
- /host/var/tmp/s3-cache
volumeMounts:
- name: host-var-tmp
mountPath: /host/var/tmp
volumes:
- name: host-var-tmp
hostPath:
path: /var/tmp
restartPolicy: Never
A job per volume is needed - and you should modify the path so that it is unique per volume.
This worked for me
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: s3-cache-dir-setup
namespace: kube-system
spec:
selector:
matchLabels:
app: s3-cache-dir-setup
template:
metadata:
labels:
app: s3-cache-dir-setup
spec:
initContainers:
- name: create-s3-cache-dir
image: busybox
command:
- sh
- -c
- |
mkdir -p /tmp/s3-local-cache && \
chmod 0700 /tmp/s3-local-cache
securityContext:
privileged: true
volumeMounts:
- name: host-mount
mountPath: /tmp/s3-local-cache
containers:
- name: pause
image: k8s.gcr.io/pause:3.1
volumes:
- name: host-mount
hostPath:
path: /tmp/s3-local-cache
From the documentation
The cache directory is not reusable by other Mountpoint processes and will be cleaned at mount time and exit. When running multiple Mountpoint processes concurrently on the same host, you should use unique cache directories to avoid different processes interfering with the others' cache content.
If this is the case, we'd need unique cache directory per pod, say there is more than one pod of the same deployment scheduled on the same node. Looks like none of the workarounds suggested above supports this scenario.
I worked through different ways of creating a host path for the project I work on. I came across a few interesting constraints that are worth sharing.
Using a provisioner for the nodes such as setting up scripts with the Karpenter EC2 Class only works if the paths used are known ahead of time. If you are scaling your buckets up and down independent of the lifetime of the node then it is impossible to create all the directories ahead of time like this.
Using a DaemonSet to mount the hostPath
on every Node does work, but you quickly hit pod limits per node if you have a scaling number of buckets you mount. For example we create review environments for every PR which have their own bucket so you can imagine as the PRs grow the number of DaemonSets grows with them leaving less and less room for other pods on the nodes.
Using hostPath
for volumes on the workloads works, but if you are using Knative or a similar wrapper for your workloads hostPath
might not be exposed. To get around this (which is ultimately the solution I used) I create PersistentVolume
configured specifically for hostPath
and then create a PersistentVolumeClaim
to map to my workloads.
apiVersion: v1
kind: PersistentVolume
metadata:
name: default-bucket-cache
spec:
storageClassName: manual
accessModes:
- ReadWriteMany
hostPath:
path: /tmp/cache-default-bucket
type: DirectoryOrCreate
capacity:
storage: 500Mi
claimRef:
namespace: default
name: bucket-cache
apiVersion: v1
kind: PersistentVolumeClaim
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
namespace: default
name: bucket-cache
spec:
storageClassName: manual
resources:
requests:
storage: 500Mi
volumeName: bucket-cache
accessModes:
- ReadWriteMany
It would be nice if there was a way to simply specify a pv for the mountpoint to use. I am thinking specifically for putting the cache somewhere else other than the host that way it could be reused across multiple nodes.
To add on to these points I would have liked for the driver to expose k8s specific caching configuration which is interpreted by the driver prior to starting a mountpoint process. This would include creating a cache directory on the node specifically for the mount being prepared, and at the path given in the configuration.
Even more ideal would be the ability to use an EBS (or other) backed volume as the cache so that normal node operations aren't able to be compromised via low disk space, but this poses some implementation questions. Perhaps #279 can offer a solution to this to run mountpoint in a sidecar.
/feature
Is your feature request related to a problem? Please describe. Caching is supported today by adding a
cache
option to a persistent volume configuration and passing in a directory on the node's filesystem. This works, but comes with a couple sharp edges. Creating the directory on the node is not done automatically, so it has to be created manually ahead of time.Describe the solution you'd like in detail Caching configuration should be possible without manually making changes to the nodes and should make it easy to define different types of storage to use as cache like a ramdisk.
Describe alternatives you've considered One potential solution is to reference other persistent volumes or mounts as cache, which could make for nice composability of the k8s constructs.
Additional context Mountpoint's documentation on caching: https://github.com/awslabs/mountpoint-s3/blob/main/doc/CONFIGURATION.md#caching-configuration