emphermal storage for very large tar ~100GB

ashish01987 commented 1 year ago

I have a local folder (backup) of ~100GB files , if i directly tar the folder onto bucket e.g tar -cf /tmp/bucketmount/backup.tar /backup/ , will there be any issues with csi driver ? I see that gcsfuse csi depends on tempDir{} or some temp directore for staging files before they are uploaded to bucket.

songjiaxun commented 1 year ago

I think CSI driver should work in this use case. @ashish01987 do you see any errors?

As you mentioned, gcsfuse uses a temp directory for staging files, as a result, please consider increasing the sidecar container ephemeral-storage-limit so that gcsfuse has enough space for staging the files.

See the GKE documentation for more information.

ashish01987 commented 1 year ago

..

ashish01987 commented 1 year ago

Thanks for the quick response. I created a tar file "backup.tar" of 30GB directly on the GCS mounted bucket (mounted by csi side car). And did not find any issues with it. Just one question here When "backup.tar" (30GB) size is being created on the mounted bucket, will the csi side car wait for complete "backup.tar" (30GB) file being created on ephemeral storage (emptyDir: {}) and then copy it to actual bucket on cloud storage ?

If yes, i am bit concern about case where "backup.tar" size go on increasing (maybe 100GB or more due to regular backups) and sufficient node ephermeral storage is not available. In this case, one may have increase the nodes ephermeral storage manually which might cause downtime for cluster (probably ?)

I see that csi side car uses this "gke-gcsfuse-tmp" mount point from emptyDir{} for staging files before uploading

emptyDir: {} name: gke-gcsfuse-tmp

It will be great if allocating storage from regular persistent disk (or nfs share ) is supported here for gke-gcsfuse-tmp. In that way we can allocate any amount of storage without changing the nodes ephermeral storage (and avoiding cluster downtime)

I tried something like this volumes:

name: gke-gcsfuse-tmp persistentVolumeClaim: claimName: common-backup where "common-backup" alllocates storage from pd or nfs share based on storage class.

However it did not work as the deployment did not start and was not able to find the csi side car. Probably some validations are in place to check if "gke-gcsfuse-tmp" is using emptyDir: {} only ?

Maybe supporting emptyDir: {} as well allocation storage from pvc as above for "gke-gcsfuse-tmp" might be beneficial (if implementation is feasible).

@songjiaxun Let me know your thoughts on this

ashish01987 commented 1 year ago

@songjiaxun any thoughts on this ?

songjiaxun commented 1 year ago

Hi @ashish01987 , thanks for testing out the staging file system.

To answer your question, yes, in current design, the volume gke-gcsfuse-tmp has to be an emptyDir, see the validation logic code.

The GCS FUSE team is working on write-through features, which means the staging volume may not be needed in the future release. @sethiay and @Tulsishah, could you share more information about the write-through feature? And will the write-through feature support this "tar file" use case?

Meanwhile, @judemars FYI as you may need to add a new volume to the sidecar container for the read caching feature.

sethiay commented 1 year ago

Thanks @songjiaxun for looping us in. Currently, we are evaluating to support write-through feature in GCSFuse i.e. to allow users to write directly to GCS without buffering on local disk. Given that tar works now with GCSFuse, we expect it to work with write-through feature as well.

ashish01987 commented 1 year ago

what is the expected timeline for the write through feature ?

sethiay commented 1 year ago

@ashish01987 Currently, we don't have any timelines to share.

ashish01987 commented 1 year ago

@songjiaxun since we dont know timeline for write through feature, as a work around can we disable this validation logic code. and support allocation storage from any pvc for gke-gcsfuse-tmp

i.e the storage can be allocated from persistent disk instead of nodes emphermal storage ?

in that way the customer using gcsfuse csi will never face issue like "insufficient ephermal storage"

Not sure but such issues can arise in cluster where multiple pods are having their own gcs-csi-side car instance

songjiaxun commented 1 year ago

@ashish01987 , thanks for the suggestion.

As we have more and more customers reporting "insufficient ephermal storage" issues, we are exploring the possibilities to allow users to use other media volumes rather than emptyDir for the write staging.

FYI @judemars .

ashish01987 commented 1 year ago

For time being is it possible to make this validation https://github.com/GoogleCloudPlatform/gcs-fuse-csi-driver/blob/b0d0325299800ea9494da9b03ba991f06830cb24/pkg/webhook/sidecar_spec.go#L113 optional ? So that I define side car with gke-gcsfuse-tmp pointing to PVC `apiVersion: v1 kind: Pod metadata: name: sidecar-test

spec: serviceAccountName: gcs-csi containers:

name: busybox image: busybox resources: limits: cpu: 250m ephemeral-storage: 1Gi memory: 256Mi requests: cpu: 250m ephemeral-storage: 1Gi memory: 256Mi command:
- "/bin/sh"
- "-c"
- sleep infinite volumeMounts:
  - name: gcs-fuse-csi-ephemeral mountPath: /data
name: gke-gcsfuse-sidecar image: gke.gcr.io/gcs-fuse-csi-driver-sidecar-mounter:v0.1.4-gke.1@sha256:442969f1e565ba63ff22837ce7a530b6cbdb26330140b7f9e1dc23f53f1df335 imagePullPolicy: IfNotPresent args:
- --v=5 resources: limits: cpu: 250m ephemeral-storage: 1Gi memory: 256Mi requests: cpu: 250m ephemeral-storage: 1Gi memory: 256Mi securityContext: allowPrivilegeEscalation: false capabilities: drop:
  - ALL readOnlyRootFilesystem: true runAsGroup: 65534 runAsNonRoot: true runAsUser: 65534 seccompProfile: type: RuntimeDefault volumeMounts:
- mountPath: /gcsfuse-tmp name: gke-gcsfuse-tmp volumes:
name: gcs-fuse-csi-ephemeral csi: driver: gcsfuse.csi.storage.gke.io volumeAttributes: bucketName: # unique bucket name
persistentVolumeClaim: claimName: my-pvc-backup name: gke-gcsfuse-tmp`

ashish01987 commented 1 year ago

I see that writing very large files like 70GB (.tar file) will fail if that much ephemeral storage is not present on the node

songjiaxun commented 1 year ago

@ashish01987 , thanks for the suggestion, and reporting the issue. I am actively working on skipping the validation and will keep you posted.

ashish01987 commented 1 year ago

Thanks for looking into this. May be it will be great if "claimName: my-pvc-backup" for gke-gcsfuse-tmp` can be passed as parameter through annotation on the pod.

songjiaxun commented 11 months ago

I am working on a new feature to allow you to specify a separate volume for the write buffering. I will keep you posted.

bhack commented 9 months ago

To not cross-posting I think that in the meantime we could still better notify the user about the sidecard specific tempdir pressure/occupancy See more at https://github.com/GoogleCloudPlatform/gcs-fuse-csi-driver/issues/21#issuecomment-1910095925

bhack commented 9 months ago

@songjiaxun In the meant time is there a temp workaround to monitor ephemeral occupancy with a kubectl exec command on the gcsfuse sidecar container? I have a pod evicted but I cannot for ephemeral storage excess and I want to investigate/monitor the gcsfuse ephemeral disk occupancy.

songjiaxun commented 9 months ago

Hi @bhack, because the gcsfuse sidecar container is a distroless container, it means you cannot run any bash commands using kubectl exec.

We are rolling out the feature to support custom volumes for write buffering. The new feature should be available soon.

Meanwhile, if you are experiencing ephemeral storage limit issues, consider setting the pod annotation gke-gcsfuse/ephemeral-storage-limit: "0". It will unset any ephemeral storage limit on the sidecar container.

bhack commented 9 months ago

gke-gcsfuse/ephemeral-storage-limit: "0

Is this ok in Autopilot or will it be rejected?

songjiaxun commented 9 months ago

gke-gcsfuse/ephemeral-storage-limit: "0

Is this ok in Autopilot or will it be rejected?

Oh sorry, I forgot the context of Autopilot. Unfortunately, no, gke-gcsfuse/ephemeral-storage-limit: "0" only works on Standard clusters.

Is your application writing large files back to the bucekt?

bhack commented 9 months ago

Is your application writing large files back to the bucekt?

Not so large. It is a classical ML workload like regular checkpoint + TB logs. All it is going ok but at some point, after many K-steps, the pod start to be regularly evicted for ephemeral storage also after restarting from the last checkpoint (e.g. I've also tested this with a restarting job or spot instances).

The main pod it is quite complex but it seems it is not writing anything other then on the csi-gcsfuse mounted volumes. But with the current tools it is hard to debug.

It is why we need something to monitoring the sidecar pressure/occupancy on the ephemeral (and likely on CPU and MEM).

It is both to debug when things fail and to create a residual margin on resources planning. The last one is also important on Autopilot as we cannot just set the limits to 0.

Just to focus on the ephemeral point In the sidecar log I see sidecar_mounter.go:86] gcsfuse mounting with args [gcsfuse --temp-dir <volume-name-temp-dir>

Can you monitor that temp-dir occupancy in Go so that we can start to have some warning in the logs?

songjiaxun commented 9 months ago

Thanks for the information @bhack . Yes, we do plan to add more logs and warnings to make the ephemeral storage usage more observable.

Can I know your node type? What compute class or hardware configuration are you using?

So on Autopilot, for most of the compute classes, the maximum ephemeral storage you can use is 10Gi. So you can use annotation gke-gcsfuse/ephemeral-storage-limit: 10Gi to specify it. Please note that the container image and logging also use ephemeral storage.

bhack commented 9 months ago

Please note that the container image and logging also use ephemeral storage.

Is the container image part of the node ephemeral? but I don't think it is part of the pod ephemeral request or not?

Cause if the image it is not part of the pod request, if we are going to request 4Gi or 5Gi -gcsfuse/ephemeral-storage-limit: and we were evicted cause we surpassed 4Gi or 5Gi of this limit it could be not caused by the image size right?

Edit: The hw configuration in this test was: nvidia-tesla-a100 in the 16 GPU config so it is a2-megagpu-16g

songjiaxun commented 9 months ago

Hi @bhack,

Yes, the container image is a part of the node ephemeral storage, see the documentation Ephemeral storage consumption management
The ephemeral storage limit calculation is different from cpu or memory -- cpu or memory limit is at container level, however, the ephemeral storage limit is at Pod level. See the documentation How Pods with ephemeral-storage requests are scheduled. This means, even though the ephemeral storage limit is applied on the gcsfuse sidecar container, all the containers in the Pod are subject to that limit.
On Autopilot cluster, the maximum ephemeral storage is 10Gi, see Minimum and maximum resource requests.
Combing all the factors, here is my suggestion:
- Change the Pod annotation to gke-gcsfuse/ephemeral-storage-limit: 10Gi.
- Audit your application to see if the container image is too large.
- Wait for the custom buffer volume support, which should be available very soon.

bhack commented 9 months ago

The main problem is still auditing the gcsfuse sidecar Vs the pod. If I am running a job for 4/5 hours doing exactly the same things e.g. like a training job/loop and then the pod was evicted I need to understand what is going to happen to the sidecar on the ephemeral storage. Is there a problem on the sidecar driver accumulating too much files in the temp dir at some point? Is there a bug? If I don't know what is the specific ephemeral pressure of the sidecar tempdir it is impossible to investigate.

songjiaxun commented 9 months ago

@bhack, yes, it makes sense. I will let you know when the warning logs are ready.

bhack commented 9 months ago

@bhack, yes, it makes sense. I will let you know when the warning logs are ready.

Thanks, I hope that we could add this also for CPU and Memory later especially as in Autopilot we cannot set sidecar resources to "0".

songjiaxun commented 9 months ago

Hi @bhack , I wanted to use the same fs metrics collection approach Kubernetes uses, for example, to use SYS_STATFS system call: https://github.com/kubernetes/kubernetes/blob/dbd3f3564ac6cca9a152a3244ab96257e5a4f00c/pkg/volume/util/fs/fs.go#L40-L63

However, i believe the SYS_STATFS system call does the calculation at a device level. It means, if the buffer volume is an emptyDir, which is the default setting in our case, the returned volume usage/availability is the underlaying boot disk usage/availability.

I am exploring other approaches to just calculate the buffer volume usage.

bhack commented 9 months ago

Are they not using the same unix.Statfs in emptydir? https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/emptydir/empty_dir_linux.go#L94-L100

bhack commented 9 months ago

What do you think about https://github.com/kubernetes/kubernetes/pull/121489 ?

cnjr2 commented 4 months ago

Relevant info is now coverered in https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/cloud-storage-fuse-csi-driver#prepare-mount

songjiaxun commented 1 month ago

As this doc https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/cloud-storage-fuse-csi-driver#buffer-volume mentioned, for large file write operation, please consider increasing the size of the write buffer volume.

Closing this case.

GoogleCloudPlatform / gcs-fuse-csi-driver

emphermal storage for very large tar ~100GB #57