foundation-model-stack / fms-fsdp

🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash attention v2.
https://pytorch.org/docs/stable/fsdp.html
Apache License 2.0
114 stars 18 forks source link

Enable asynchronous dataloading #75

Closed daviswer closed 1 month ago

daviswer commented 2 months ago

Current dataloader still causes gradual asymptotic slowdowns - likely because we have n_workers fixed to 0 in the dataloader. This forces the main process to also handle dataloading in a synchronous manner, allowing the two tasks to interfere. The reason we fix to 0 is because the main process performs checkpointing, and if dataloading is occurring on a separate worker process, the master cannot access the relevant state information from the worker.

This PR adds support for n_workers set to 1, allowing the worker to checkpoint itself at set intervals, separate from the model/optimizer checkpointing occurring in the master process. This is accomplished via a new Checkpoint_Dataset wrapper that performs checkpointing on set intervals. Training script and other peripherals are updated to set n_workers to 1.

Note that while the Checkpoint_Dataset class has been correctness-checked via the new unit test, the main training script has not yet been tested with these changes. We do not yet know if this PR will fix the throughput issue, and this should not be merged until we do.

lchu-ibm commented 2 months ago

can we add some prints/logging in the new checkpointer?

  1. when no data ckpt found, print something to indicate that (including which path it didn't find the ckpt), like what we did in the older checkpointer.
  2. when loading, also print the path (i.e. where it found the data ckpt).
  3. when saved, also print how much time it took, like what we did.

once everything looking good, we should also clean the old checkpointer to completely remove the data part.

daviswer commented 2 months ago

Added the requested status reports, I figure we'll clean up the checkpointer utility once we have this tested and working to our satisfaction

lchu-ibm commented 1 month ago

@daviswer I just merged latest main to this branch.

lchu-ibm commented 1 month ago

all local tests passed and perf is better.