🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash attention v2.
We turned off torch.compile support a while ago due to 1. compile accuracy issue 2. graph break when compiling rope
Now as both are fixed, we should turn this back on to support compile.
Initial experiments shows consistent loss curve over different runs (non-compile vs. compile-with-ac vs. compile-without-ac vs. compile-with-selective-ac).
We also need to make accumulated_cache_size_limit higher in order to make 70b model compile-able, otherwise it will throw torch._dynamo hit config.accumulated_cache_size_limit (64) and break graph compile.
We turned off torch.compile support a while ago due to 1. compile accuracy issue 2. graph break when compiling rope
Now as both are fixed, we should turn this back on to support compile.
Initial experiments shows consistent loss curve over different runs (non-compile vs. compile-with-ac vs. compile-without-ac vs. compile-with-selective-ac).