Closed ashish01987 closed 1 month ago
I think CSI driver should work in this use case. @ashish01987 do you see any errors?
As you mentioned, gcsfuse uses a temp directory for staging files, as a result, please consider increasing the sidecar container ephemeral-storage-limit so that gcsfuse has enough space for staging the files.
See the GKE documentation for more information.
..
Thanks for the quick response. I created a tar file "backup.tar" of 30GB directly on the GCS mounted bucket (mounted by csi side car). And did not find any issues with it. Just one question here When "backup.tar" (30GB) size is being created on the mounted bucket, will the csi side car wait for complete "backup.tar" (30GB) file being created on ephemeral storage (emptyDir: {}) and then copy it to actual bucket on cloud storage ?
If yes, i am bit concern about case where "backup.tar" size go on increasing (maybe 100GB or more due to regular backups) and sufficient node ephermeral storage is not available. In this case, one may have increase the nodes ephermeral storage manually which might cause downtime for cluster (probably ?)
I see that csi side car uses this "gke-gcsfuse-tmp" mount point from emptyDir{} for staging files before uploading
It will be great if allocating storage from regular persistent disk (or nfs share ) is supported here for gke-gcsfuse-tmp. In that way we can allocate any amount of storage without changing the nodes ephermeral storage (and avoiding cluster downtime)
I tried something like this volumes:
However it did not work as the deployment did not start and was not able to find the csi side car. Probably some validations are in place to check if "gke-gcsfuse-tmp" is using emptyDir: {} only ?
Maybe supporting emptyDir: {} as well allocation storage from pvc as above for "gke-gcsfuse-tmp" might be beneficial (if implementation is feasible).
@songjiaxun Let me know your thoughts on this
@songjiaxun any thoughts on this ?
Hi @ashish01987 , thanks for testing out the staging file system.
To answer your question, yes, in current design, the volume gke-gcsfuse-tmp
has to be an emptyDir, see the validation logic code.
The GCS FUSE team is working on write-through features, which means the staging volume may not be needed in the future release. @sethiay and @Tulsishah, could you share more information about the write-through feature? And will the write-through feature support this "tar file" use case?
Meanwhile, @judemars FYI as you may need to add a new volume to the sidecar container for the read caching feature.
Thanks @songjiaxun for looping us in. Currently, we are evaluating to support write-through feature in GCSFuse i.e. to allow users to write directly to GCS without buffering on local disk. Given that tar works now with GCSFuse, we expect it to work with write-through feature as well.
what is the expected timeline for the write through feature ?
@ashish01987 Currently, we don't have any timelines to share.
@songjiaxun since we dont know timeline for write through feature, as a work around can we disable this validation logic code. and support allocation storage from any pvc for gke-gcsfuse-tmp
i.e the storage can be allocated from persistent disk instead of nodes emphermal storage ?
in that way the customer using gcsfuse csi will never face issue like "insufficient ephermal storage"
Not sure but such issues can arise in cluster where multiple pods are having their own gcs-csi-side car instance
@ashish01987 , thanks for the suggestion.
As we have more and more customers reporting "insufficient ephermal storage" issues, we are exploring the possibilities to allow users to use other media volumes rather than emptyDir for the write staging.
FYI @judemars .
For time being is it possible to make this validation https://github.com/GoogleCloudPlatform/gcs-fuse-csi-driver/blob/b0d0325299800ea9494da9b03ba991f06830cb24/pkg/webhook/sidecar_spec.go#L113 optional ? So that I define side car with gke-gcsfuse-tmp pointing to PVC `apiVersion: v1 kind: Pod metadata: name: sidecar-test
spec: serviceAccountName: gcs-csi containers:
name: busybox image: busybox resources: limits: cpu: 250m ephemeral-storage: 1Gi memory: 256Mi requests: cpu: 250m ephemeral-storage: 1Gi memory: 256Mi command:
name: gke-gcsfuse-sidecar image: gke.gcr.io/gcs-fuse-csi-driver-sidecar-mounter:v0.1.4-gke.1@sha256:442969f1e565ba63ff22837ce7a530b6cbdb26330140b7f9e1dc23f53f1df335 imagePullPolicy: IfNotPresent args:
name: gcs-fuse-csi-ephemeral
csi:
driver: gcsfuse.csi.storage.gke.io
volumeAttributes:
bucketName:
persistentVolumeClaim: claimName: my-pvc-backup name: gke-gcsfuse-tmp`
I see that writing very large files like 70GB (.tar file) will fail if that much ephemeral storage is not present on the node
@ashish01987 , thanks for the suggestion, and reporting the issue. I am actively working on skipping the validation and will keep you posted.
Thanks for looking into this. May be it will be great if "claimName: my-pvc-backup" for gke-gcsfuse-tmp` can be passed as parameter through annotation on the pod.
I am working on a new feature to allow you to specify a separate volume for the write buffering. I will keep you posted.
To not cross-posting I think that in the meantime we could still better notify the user about the sidecard specific tempdir
pressure/occupancy
See more at https://github.com/GoogleCloudPlatform/gcs-fuse-csi-driver/issues/21#issuecomment-1910095925
@songjiaxun In the meant time is there a temp workaround to monitor ephemeral occupancy with a kubectl exec
command on the gcsfuse sidecar container?
I have a pod evicted but I cannot for ephemeral storage excess and I want to investigate/monitor the gcsfuse ephemeral disk occupancy.
Hi @bhack, because the gcsfuse sidecar container is a distroless container, it means you cannot run any bash commands using kubectl exec
.
We are rolling out the feature to support custom volumes for write buffering. The new feature should be available soon.
Meanwhile, if you are experiencing ephemeral storage limit issues, consider setting the pod annotation gke-gcsfuse/ephemeral-storage-limit: "0"
. It will unset any ephemeral storage limit on the sidecar container.
gke-gcsfuse/ephemeral-storage-limit: "0
Is this ok in Autopilot or will it be rejected?
gke-gcsfuse/ephemeral-storage-limit: "0
Is this ok in Autopilot or will it be rejected?
Oh sorry, I forgot the context of Autopilot. Unfortunately, no, gke-gcsfuse/ephemeral-storage-limit: "0"
only works on Standard clusters.
Is your application writing large files back to the bucekt?
Is your application writing large files back to the bucekt?
Not so large. It is a classical ML workload like regular checkpoint + TB logs. All it is going ok but at some point, after many K-steps, the pod start to be regularly evicted for ephemeral storage also after restarting from the last checkpoint (e.g. I've also tested this with a restarting job or spot instances).
The main pod it is quite complex but it seems it is not writing anything other then on the csi-gcsfuse mounted volumes. But with the current tools it is hard to debug.
It is why we need something to monitoring the sidecar pressure/occupancy on the ephemeral (and likely on CPU and MEM).
It is both to debug when things fail and to create a residual margin on resources planning. The last one is also important on Autopilot as we cannot just set the limits to 0.
Just to focus on the ephemeral point In the sidecar log I see
sidecar_mounter.go:86] gcsfuse mounting with args [gcsfuse --temp-dir <volume-name-temp-dir>
Can you monitor that temp-dir
occupancy in Go
so that we can start to have some warning in the logs?
Thanks for the information @bhack . Yes, we do plan to add more logs and warnings to make the ephemeral storage usage more observable.
Can I know your node type? What compute class or hardware configuration are you using?
So on Autopilot, for most of the compute classes, the maximum ephemeral storage you can use is 10Gi. So you can use annotation gke-gcsfuse/ephemeral-storage-limit: 10Gi
to specify it. Please note that the container image and logging also use ephemeral storage.
Please note that the container image and logging also use ephemeral storage.
Is the container image part of the node ephemeral? but I don't think it is part of the pod ephemeral request or not?
Cause if the image it is not part of the pod request, if we are going to request 4Gi or 5Gi -gcsfuse/ephemeral-storage-limit:
and we were evicted cause we surpassed 4Gi or 5Gi of this limit it could be not caused by the image size right?
Edit:
The hw configuration in this test was:
nvidia-tesla-a100
in the 16 GPU config so it is a2-megagpu-16g
Hi @bhack,
gke-gcsfuse/ephemeral-storage-limit: 10Gi
.The main problem is still auditing the gcsfuse sidecar Vs the pod. If I am running a job for 4/5 hours doing exactly the same things e.g. like a training job/loop and then the pod was evicted I need to understand what is going to happen to the sidecar on the ephemeral storage. Is there a problem on the sidecar driver accumulating too much files in the temp dir at some point? Is there a bug? If I don't know what is the specific ephemeral pressure of the sidecar tempdir it is impossible to investigate.
@bhack, yes, it makes sense. I will let you know when the warning logs are ready.
@bhack, yes, it makes sense. I will let you know when the warning logs are ready.
Thanks, I hope that we could add this also for CPU and Memory later especially as in Autopilot we cannot set sidecar resources to "0".
Hi @bhack , I wanted to use the same fs metrics collection approach Kubernetes uses, for example, to use SYS_STATFS
system call: https://github.com/kubernetes/kubernetes/blob/dbd3f3564ac6cca9a152a3244ab96257e5a4f00c/pkg/volume/util/fs/fs.go#L40-L63
However, i believe the SYS_STATFS
system call does the calculation at a device level. It means, if the buffer volume is an emptyDir, which is the default setting in our case, the returned volume usage/availability is the underlaying boot disk usage/availability.
I am exploring other approaches to just calculate the buffer volume usage.
Are they not using the same unix.Statfs
in emptydir?
https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/emptydir/empty_dir_linux.go#L94-L100
What do you think about https://github.com/kubernetes/kubernetes/pull/121489 ?
Relevant info is now coverered in https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/cloud-storage-fuse-csi-driver#prepare-mount
As this doc https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/cloud-storage-fuse-csi-driver#buffer-volume mentioned, for large file write operation, please consider increasing the size of the write buffer volume.
Closing this case.
I have a local folder (backup) of ~100GB files , if i directly tar the folder onto bucket e.g tar -cf /tmp/bucketmount/backup.tar /backup/ , will there be any issues with csi driver ? I see that gcsfuse csi depends on tempDir{} or some temp directore for staging files before they are uploaded to bucket.