Did you use eos token inbetween two documents?

FranxYao / Long-Context-Data-Engineering

Implementation of paper Data Engineering for Scaling Language Models to 128K Context

416 stars 26 forks source link

Did you use eos token inbetween two documents? #9

Closed jzhang38 closed 5 months ago

jzhang38 commented 6 months ago

Seems that you directly concatenate two documents without using the eos token?

FranxYao commented 6 months ago

No I did not include the eos token, but note that there is a token at the beginning, which already serves the purpose for separating two sentences

jzhang38 commented 6 months ago

Could you elaborate more on what "there is a token at the beginning" means? Are you saying Llama tokenizer would automatically prepend bos token to each sequence being tokenized?

jzhang38 commented 5 months ago

I checked the dataset and there is indeed a bos token at the front of every document. Closing this issue.