EmilyAlsentzer / clinicalBERT

repository for Publicly Available Clinical BERT Embeddings
MIT License
673 stars 135 forks source link

Tokenizer is not cased #26

Closed hjian42 closed 4 years ago

hjian42 commented 4 years ago

In your script at https://github.com/EmilyAlsentzer/clinicalBERT/blob/master/lm_pretraining/create_pretraining_data.py, the do_lower_case is actually set to be "True".

So I went to load the model. When I checked your vocabulary, your vocabulary is a mixed of cased and uncased words since you inherit it from bioBERT. However, when I used your tokenizer to tokenize a sentence, I found out words will be lowered cased.

Do you mind clarifying this a bit? Thanks a lot.

gunvantc commented 4 years ago

The "True" on line 41 seems to be a default. do_lower_case is set to False when calling this file in create_pretrain_data.sh

hjian42 commented 4 years ago

Right, the bash script seems to set do_lower_case=False. Also, the vocab is cased when I loaded it and looked it up. This is because the tokenizer is inherited from cased BioBERT.

However, when I load the tokenizer and run it on one made-up sentence. I see the following:

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained('emilyalsentzer/Bio_ClinicalBERT')
>>> s = "Abnormality is found in the left lobe of the lung."
>>> tokenizer.tokenize(s)
['abnormal', '##ity', 'is', 'found', 'in', 'the', 'left', 'lobe', 'of', 'the', 'lung', '.']

Basically the tokenizer still does case lowering by default. So I am a bit confused here. @EmilyAlsentzer

gunvantc commented 4 years ago

do_lower_case is an attribute you can set when initializing a tokenizer using transformers. It seems to default to True. Try: tokenizer = AutoTokenizer.from_pretrained('emilyalsentzer/Bio_ClinicalBERT, do_lower_case=False')

hjian42 commented 4 years ago

Thanks! That is very helpful!