🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash attention v2.
Currently transformer block is defined and used in a few places including ac_handler module and fsdp_wrapper module.
This PR will centralize these into main (where the model is defined) so it is easier to switch from one model to another.
Currently transformer block is defined and used in a few places including ac_handler module and fsdp_wrapper module. This PR will centralize these into main (where the model is defined) so it is easier to switch from one model to another.