🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash attention v2.
a list of minors around training configs that should be fixed later.
this should really be > num_step rather than == num_step, as our batch_idx starts from 1 rather than 0. The current implementation would skip the required last step causing last checkpoint won't be written.
a list of minors around training configs that should be fixed later.
> num_step
rather than== num_step
, as our batch_idx starts from 1 rather than 0. The current implementation would skip the required last step causing last checkpoint won't be written.cfg.
heresharding_group_size
in training configs, as we dropped the support for both SSDP and TP in the open source version.