Add proper BOS support to dataloader

foundation-model-stack / fms-fsdp

🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash attention v2.

https://pytorch.org/docs/stable/fsdp.html

Apache License 2.0

114 stars 18 forks source link

Add proper BOS support to dataloader #81

Closed daviswer closed 1 month ago

daviswer commented 1 month ago

Previous BOS/EOS support occurred at the line level, as in encoder-decoder training. With Llama3 BOS/EOS usage we want to extend this behavior to the document level. This PR adds support for bos_token flag in training and dataloading. A new unit test ensures that chunking behavior remains consistent and correct.