EmilyAlsentzer / clinicalBERT

repository for Publicly Available Clinical BERT Embeddings
MIT License
673 stars 135 forks source link

Vocabulary for the pre-trained model is not updated ? Any reason why #31

Closed NeverInAsh closed 3 years ago

NeverInAsh commented 3 years ago

Thanks for making such a comprehensive bert model.

I am worried about the actual words that I find in the model though. Author mentions that "The Bio_ClinicalBERT model was trained on all notes from MIMIC III, a database containing electronic health records from ICU patients at the Beth Israel Hospital in Boston, MA. For more details on MIMIC". I am supposing this would have mean that the vocab will also be updated.

But when i see the vocabulary words, I don't see medical concepts.

from transformers import TFBertModel,  BertConfig, BertTokenizerFast
# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizerFast.from_pretrained('emilyalsentzer/Bio_ClinicalBERT')
tokenizer.vocab.keys()

['Cafe', 'locomotive', 'sob', 'Emilio', 'Amazing', '##ired', 'Lai', 'NSA', 'counts', '##nius', 'assumes', 'talked', 'ク', 'rumor', 'Lund', 'Right', 'Pleasant', 'Aquino', 'Synod', 'scroll', '##cope', 'guitarist', 'AB', '##phere', 'resulted', 'relocation', 'ṣ', 'electors', '##tinuum', 'shuddered', 'Josephine', '"', 'nineteenth', 'hydroelectric', '##genic', '68', '1000', 'offensive', 'Activities', '##ito', 'excluded', 'dictatorship', 'protruding', '1832', 'perpetual', 'cu', '##36', 'outlet', 'elaborate', '##aft', 'yesterday', '##ope', 'rockets', 'Eduard', 'straining', '510', 'passion', 'Too', 'conferred', 'geography', '38', 'Got', 'snail', 'cellular', '##cation', 'blinked', 'transmitted', 'Pasadena', 'escort', 'bombings', 'Philips', '##cky', 'sacks', '##Ñ', 'jumps', 'Advertising', 'Officer', '##ulp', 'potatoes', 'concentration', 'existed', '##rrigan', '##ier', 'Far', 'models', 'strengthen', 'mechanics'...]

Am i missing something here ?

Also, is there any uncased version present for this model ?

EmilyAlsentzer commented 3 years ago

You are correct that the clinicalBERT models use the exact same vocabulary as the original BERT models. This is because we first initialized the models with the BERT base parameters and then further trained the masked LM & next sentence prediction heads on MIMIC data. While training BERT from scratch on clinical data with a clinical vocabulary would certainly be better, training from scratch is very expensive (i.e. requires extensive GPU resources & time).

That being said, BERT uses word pieces for its vocabulary, rather than just whole words. Traditionally in NLP, any words not found in the vocabulary are represented as an UNKNOWN token. This makes it difficult to generalize to new domains. However, because BERT uses word pieces, this problem is not as severe. If a word does not appear in the BERT vocabulary during preprocessing, then the word is broken down to its word pieces. For example, penicillin may not be in the BERT vocabulary, but perhaps the word pieces "pen", "i", and "cillin" are present. In this example, the word piece "pen" would then likely have a very different contextual embedding in clinicalBERT compared to general domain BERT because it is frequently found in the context of a drug. In the paper, we show that the nearest neighbors of embeddings of disease & operations-related words make more sense when the words are embedded by clinicalBERT compared to bioBERT & general BERT.

Unfortunately, we don't have an uncased version of the model at this time.

Hope this helps!

NeverInAsh commented 3 years ago

Thanks for a very crisp reply. One question though, when you say

nearest neighbors of embeddings of disease & operations-related words make more sense when the words are embedded by clinicalBERT compared to bioBERT.

Is there any non-empirical explanation for it ? Bio-bert seems to have a custom vocabulary and covers many concepts. I am attaching an image about the same. These are the top 20 UMLs concepts by count in the vocabulary. image

EmilyAlsentzer commented 3 years ago

I think BioBERT updated their model recently (or at least after clinicalBERT was published). The model we compared to in our paper had the same vocabulary as BERT. Check out the issue on their github where someone had a similar question to yours.

I do agree with you that a custom vocabulary would likely be better. I don't currently have the bandwidth to train it, but if you end up doing so, let us know!