cozek / OffensEval2020-code

MIT License
17 stars 3 forks source link

Why add EOS token after every sentence? #1

Open lucacampanella opened 4 years ago

lucacampanella commented 4 years ago

Hi, Thanks a lot for sharing the code with us, interesting work! I have a question regarding tokenization for GPT-2. I've seen that you add an EOS token at the end of every sentence in each text example. Here:

def add_eos_tokens(self, text):
        eos_token = " " + self.transformer_tokenizer.eos_token + " "
        sentences = self.sentence_detector.tokenize(text)
        eos_added_text = (
            eos_token.join(sentences) + " " + self.transformer_tokenizer.eos_token
        )
        return eos_added_text

Why do you do this? Wouldn't one at the end of the whole text be sufficient? Thanks a lot for your input :)

cozek commented 4 years ago

Hi, thanks for reaching out. You are not wrong. Adding one at the end should be sufficient. I just wanted to better mark sentence boundaries. I would suggest you take a tiny subset of the data, fix the random seed and test both and pick whichever performs better for your test case. Its been a long time since I did this work, but I think I tested both.

lucacampanella commented 4 years ago

I understand the logic now. Thanks for the suggestions and the quick reply! :)