Closed jzhang38 closed 5 months ago
No I did not include the eos token, but note that there is a token at the beginning, which already serves the purpose for separating two sentences
Could you elaborate more on what "there is a token at the beginning" means? Are you saying Llama tokenizer would automatically prepend bos token to each sequence being tokenized?
I checked the dataset and there is indeed a bos token at the front of every document. Closing this issue.
Seems that you directly concatenate two documents without using the eos token?