[x] 1. Preprocess padding tokens ('w_pad' and 't_pad'). It is okay to delete them altogether, since in the evaluation code, we filter them anyway. Wordpiece tokenizer used for BERT model would include zero padding anyway.
[x] 2. Need to consider BERT's wordpiece tokenizer: each token in the training set can be divided into multiple tokens when being fed into BERT.
2.1. How to take care of "valid ids". May want to include a separate "[INV]" token like Naver implementation, or mask all of them so as to not introduce noise.
[x] 3. Check the maximum sequence length in the original data. Make sure 128 is a sufficient length.
[x] 1. Preprocess padding tokens ('w_pad' and 't_pad'). It is okay to delete them altogether, since in the evaluation code, we filter them anyway. Wordpiece tokenizer used for BERT model would include zero padding anyway.
[x] 2. Need to consider BERT's wordpiece tokenizer: each token in the training set can be divided into multiple tokens when being fed into BERT. 2.1. How to take care of "valid ids". May want to include a separate "[INV]" token like Naver implementation, or mask all of them so as to not introduce noise.
[x] 3. Check the maximum sequence length in the original data. Make sure 128 is a sufficient length.