digitalepidemiologylab / covid-twitter-bert

Pretrained BERT model for analysing COVID-19 Twitter data
MIT License
184 stars 27 forks source link

Logic behind [UNK] token for pronouns #5

Closed ogencoglu closed 4 years ago

ogencoglu commented 4 years ago

Thanks a lot for the nice work!

What would be the logic behind masking pronouns with an unknown token of [UNK]. This seems to be a major deviation from standard BERT models.

For example:

tokenizer = AutoTokenizer.from_pretrained("digitalepidemiologylab/covid-twitter-bert", do_lower_case=False)
tokenizer.tokenize('She is cool!')

outputs ['[UNK]', 'is', 'cool', '!']

while

tokenizer2 = AutoTokenizer.from_pretrained("bert-large-cased", do_lower_case=False)
tokenizer2.tokenize('She is cool!')

outputs ['She', 'is', 'cool', '!']

mar-muel commented 4 years ago

Hi @ogencoglu thanks for your post! We are using the uncased version of BERT large. So you need to set do_lower_case to True!

>>> tokenizer = AutoTokenizer.from_pretrained("digitalepidemiologylab/covid-twitter-bert", do_lower_case=True)
>>> tokenizer.tokenize('She is cool!')
['she', 'is', 'cool', '!']

Let me know if it works!

ogencoglu commented 4 years ago

Thanks for quick reply. My lack of attention :).