GoogleCloudPlatform / gcs-fuse-csi-driver

The Google Cloud Storage FUSE Container Storage Interface (CSI) Plugin.
Apache License 2.0
115 stars 28 forks source link

Optimization Needed for File Listing Operations on gcsi-fuse-csi Mounted Volumes in Training Jobs #200

Open bhack opened 5 months ago

bhack commented 5 months ago

We've encountered a significant performance bottleneck in our training jobs, specifically when using file listing commands like Path.rglob to enumerate trainable assets stored on gcsi-fuse-csi mounted volumes. This issue becomes particularly evident with datasets of typical size, leading to considerable cold start delays before training can commence.

This latency not only hinders the initial start-up of our training jobs but also poses a substantial challenge when utilizing GKE spot instances. Each time a job is preempted and subsequently restarts from the last saved checkpoint, it incurs this cold start penalty again due to the necessity to re-prepare data loaders.

This recurring overhead directly impacts cost-efficiency and resource utilization, particularly in a dynamic scaling environment where jobs are frequently interrupted and resumed. Addressing this file listing performance issue could significantly reduce start-up times and improve the overall efficiency of training jobs on GKE spot instances.

songjiaxun commented 5 months ago

I believe we will need the following new features or improvements to solve this issue:

  1. Somehow pre-fetch or cache the object metadata to perform the fast listing.
  2. Persistent these metadata across pod or node lifecycles.

We are working on our roadmap, and will share more information.

FYI @sethiay

bhack commented 5 months ago

This seems a good plan but I think also that there is a specific extra overhead with Path.rglob if we compare with other types of listing.