FranxYao / Long-Context-Data-Engineering

Implementation of paper Data Engineering for Scaling Language Models to 128K Context
416 stars 26 forks source link

Did you use eos token inbetween two documents? #9

Closed jzhang38 closed 5 months ago

jzhang38 commented 6 months ago
Screenshot 2024-03-20 at 7 25 02 PM

Seems that you directly concatenate two documents without using the eos token?

FranxYao commented 6 months ago

No I did not include the eos token, but note that there is a token at the beginning, which already serves the purpose for separating two sentences

jzhang38 commented 6 months ago

Could you elaborate more on what "there is a token at the beginning" means? Are you saying Llama tokenizer would automatically prepend bos token to each sequence being tokenized?

jzhang38 commented 5 months ago

I checked the dataset and there is indeed a bos token at the front of every document. Closing this issue.