Open bhack opened 8 months ago
I believe we will need the following new features or improvements to solve this issue:
We are working on our roadmap, and will share more information.
FYI @sethiay
This seems a good plan but I think also that there is a specific extra overhead with Path.rglob
if we compare with other types of listing.
We've encountered a significant performance bottleneck in our training jobs, specifically when using file listing commands like
Path.rglob
to enumerate trainable assets stored on gcsi-fuse-csi mounted volumes. This issue becomes particularly evident with datasets of typical size, leading to considerable cold start delays before training can commence.This latency not only hinders the initial start-up of our training jobs but also poses a substantial challenge when utilizing GKE spot instances. Each time a job is preempted and subsequently restarts from the last saved checkpoint, it incurs this cold start penalty again due to the necessity to re-prepare data loaders.
This recurring overhead directly impacts cost-efficiency and resource utilization, particularly in a dynamic scaling environment where jobs are frequently interrupted and resumed. Addressing this file listing performance issue could significantly reduce start-up times and improve the overall efficiency of training jobs on GKE spot instances.