Closed hjian42 closed 4 years ago
The "True" on line 41 seems to be a default. do_lower_case is set to False when calling this file in create_pretrain_data.sh
Right, the bash script seems to set do_lower_case=False
. Also, the vocab is cased when I loaded it and looked it up. This is because the tokenizer is inherited from cased BioBERT.
However, when I load the tokenizer and run it on one made-up sentence. I see the following:
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained('emilyalsentzer/Bio_ClinicalBERT')
>>> s = "Abnormality is found in the left lobe of the lung."
>>> tokenizer.tokenize(s)
['abnormal', '##ity', 'is', 'found', 'in', 'the', 'left', 'lobe', 'of', 'the', 'lung', '.']
Basically the tokenizer still does case lowering by default. So I am a bit confused here. @EmilyAlsentzer
do_lower_case is an attribute you can set when initializing a tokenizer using transformers. It seems to default to True. Try:
tokenizer = AutoTokenizer.from_pretrained('emilyalsentzer/Bio_ClinicalBERT, do_lower_case=False')
Thanks! That is very helpful!
In your script at https://github.com/EmilyAlsentzer/clinicalBERT/blob/master/lm_pretraining/create_pretraining_data.py, the do_lower_case is actually set to be "True".
So I went to load the model. When I checked your vocabulary, your vocabulary is a mixed of cased and uncased words since you inherit it from bioBERT. However, when I used your tokenizer to tokenize a sentence, I found out words will be lowered cased.
Do you mind clarifying this a bit? Thanks a lot.