In the original Llama repository, a BOS token is prepended during inference, as seen in this code snippet.
Given this, should we also prepend a BOS token for each document during the 2nd stage of pretraining to ensure alignment with the original model's practices?
From prior models such as GPT-2 and BLOOM, a <|endoftext|> token is typically used to delineate separate documents. For example, a common approach is doc1 <eos> doc2 <eos> .... While I'm uncertain about Llama-2's exact handling of this, maybe something like <bos> doc1 <eos> <bos> doc2 <eos> ...?
In the original Llama repository, a BOS token is prepended during inference, as seen in this code snippet.
Given this, should we also prepend a BOS token for each document during the 2nd stage of pretraining to ensure alignment with the original model's practices?
From prior models such as GPT-2 and BLOOM, a
<|endoftext|>
token is typically used to delineate separate documents. For example, a common approach isdoc1 <eos> doc2 <eos> ....
While I'm uncertain about Llama-2's exact handling of this, maybe something like<bos> doc1 <eos> <bos> doc2 <eos> ...
?