Preprocessing - Githubissues

syncdoth commented 4 years ago

[x] 1. Preprocess padding tokens ('w_pad' and 't_pad'). It is okay to delete them altogether, since in the evaluation code, we filter them anyway. Wordpiece tokenizer used for BERT model would include zero padding anyway.
[x] 2. Need to consider BERT's wordpiece tokenizer: each token in the training set can be divided into multiple tokens when being fed into BERT. 2.1. How to take care of "valid ids". May want to include a separate "[INV]" token like Naver implementation, or mask all of them so as to not introduce noise.
[x] 3. Check the maximum sequence length in the original data. Make sure 128 is a sufficient length.

syncdoth commented 4 years ago

The plan is to iclude [INV]token, and then use label_mask during the prediction stage to mask the extra tokens. Issue resolved.

syncdoth commented 3 years ago

max seq length has to be at least 256. Updated code and retained justification plots as well.

KyuDounSim / COMP4901K-COVID-19-NER