codertimo / BERT-pytorch

Google AI 2018 BERT pytorch implementation
Apache License 2.0
6.15k stars 1.3k forks source link

Should random sampling work per every epoch? #2

Closed codertimo closed 5 years ago

codertimo commented 5 years ago

Is is okay to use random sampled data which is saved before the training? I mean, should it have to be changed every epoch?

DSKSD commented 5 years ago

Hi, Junseong. I think you should sampling per every batch to prevent overfitting. And I have a question. how do you tokenize your dataset? Their vocab size is only 30000(subword level) because of their internal source (WPM).

codertimo commented 5 years ago

Thank you for asking @DSKSD , good to see you on opensource community

Sampling Issue

Hmm I worried about overfitting too if we using the generated(random sampled) corpus, every epoch. I'm not sure how the google guys did it, maybe I can send the email to author?

Tokenization issue

The paper refer that they used the wordpiece model which google ai developed (it might be different with sentence-piece). So input corpus should be tokenized. In my code, we just only tokenize with space character. (add+ I used enterprise internal tokenization model for korean corpus)

MarkWuNLP commented 5 years ago

Hi, Junseong, As far as I know, wordpiece can be obtained easily with learn_bpe and apply_bpe in machine translation open source code (OpenNMT and tensor2tensor). You can try it! However, the effect of BPE in Chinese is not as good as that in English. I believe similar phenomenon will be oberseved in Korean as well.

codertimo commented 5 years ago

@DSKSD 0.0.1a3 version updated for random sampling per batch