Prepend bos token - Githubissues

In the original Llama repository, a BOS token is prepended during inference, as seen in this code snippet.

Given this, should we also prepend a BOS token for each document during the 2nd stage of pretraining to ensure alignment with the original model's practices?

From prior models such as GPT-2 and BLOOM, a <|endoftext|> token is typically used to delineate separate documents. For example, a common approach is doc1 <eos> doc2 <eos> .... While I'm uncertain about Llama-2's exact handling of this, maybe something like <bos> doc1 <eos> <bos> doc2 <eos> ...?

epfLLM / Megatron-LLM

Prepend bos token #54