foundation-model-stack / fms-fsdp

🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash attention v2.
https://pytorch.org/docs/stable/fsdp.html
Apache License 2.0
116 stars 18 forks source link

Clean up training configs #7

Closed lchu-ibm closed 4 months ago

lchu-ibm commented 5 months ago

a list of minors around training configs that should be fixed later.

  1. this should really be > num_step rather than == num_step, as our batch_idx starts from 1 rather than 0. The current implementation would skip the required last step causing last checkpoint won't be written.
  2. we should remove cfg. here
  3. remove sharding_group_size in training configs, as we dropped the support for both SSDP and TP in the open source version.
  4. re-order training configs to a more meaning grouping.