Clean up training configs

foundation-model-stack / fms-fsdp

🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash attention v2.

https://pytorch.org/docs/stable/fsdp.html

Apache License 2.0

116 stars 18 forks source link

Clean up training configs #7

Closed lchu-ibm closed 4 months ago

lchu-ibm commented 5 months ago

a list of minors around training configs that should be fixed later.

this should really be > num_step rather than == num_step, as our batch_idx starts from 1 rather than 0. The current implementation would skip the required last step causing last checkpoint won't be written.
we should remove cfg. here
remove sharding_group_size in training configs, as we dropped the support for both SSDP and TP in the open source version.
re-order training configs to a more meaning grouping.