Open lucacampanella opened 4 years ago
Hi, thanks for reaching out. You are not wrong. Adding one at the end should be sufficient. I just wanted to better mark sentence boundaries. I would suggest you take a tiny subset of the data, fix the random seed and test both and pick whichever performs better for your test case. Its been a long time since I did this work, but I think I tested both.
I understand the logic now. Thanks for the suggestions and the quick reply! :)
Hi, Thanks a lot for sharing the code with us, interesting work! I have a question regarding tokenization for GPT-2. I've seen that you add an EOS token at the end of every sentence in each text example. Here:
Why do you do this? Wouldn't one at the end of the whole text be sufficient? Thanks a lot for your input :)