🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash attention v2.
Both can be technically solved in a different (potentially better) way but neither is trivial. So we make a balanced decision at this stage to resolve both for now with a slightly-less-optimal solution.
Corresponding issues should remain open for future revisit.
Related issue:
https://github.com/foundation-model-stack/fms-fsdp/issues/6 and https://github.com/foundation-model-stack/fms-fsdp/issues/15
Both can be technically solved in a different (potentially better) way but neither is trivial. So we make a balanced decision at this stage to resolve both for now with a slightly-less-optimal solution.
Corresponding issues should remain open for future revisit.