🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash attention v2.
Current behavior does not align with torchtitan and the negative index has started to cause issues. Sets default prefix length to 0 instead of 1, masking no tokens instead of the first.
Current behavior does not align with torchtitan and the negative index has started to cause issues. Sets default prefix length to 0 instead of 1, masking no tokens instead of the first.