GoogleCloudPlatform / dataflux-pytorch

The Dataflux Accelerated Dataloader for PyTorch with GCS is an effort to improve ML-training efficiency when using data stored in GCS for training datasets. Using the Dataflux Accelerated Dataloader for training is up to 3X faster when the dataset consists of many small files (e.g., 100 - 500 KB).
Apache License 2.0
26 stars 4 forks source link

add upload and download improvements to multinode #141

Closed jdnurme closed 1 day ago

jdnurme commented 1 day ago

Updated GCSFileSystem to use multipart upload and faster download. Updated associated tests, ran locally and against real GPU cluster to validate.