EmilyAlsentzer / clinicalBERT

repository for Publicly Available Clinical BERT Embeddings
MIT License
658 stars 134 forks source link

inquiry about cosine similarity between tokenized sentences #33

Closed johnsonpeng5 closed 3 years ago

johnsonpeng5 commented 3 years ago

Hi there; you're honestly doing God's work here and sharing it on hugging face.

I am however very confused by how to appropriately use this tool. I was originally trying to tokenize sentences with the clinicalbert trained on discharge summaries; and tried to see if it was able to recognize similar medical terminologies and lump them together, or return high similarity words. So far, it seems like the base bert performs better. Would there ever be a world where your work gets extended into a stsb type of a model?

EmilyAlsentzer commented 3 years ago

Apologies for the delayed response! Unfortunately, there aren't any immediate plans to do any sort of Semantic Textual Similarity work. I am surprised to hear that BERT-base is performing better. Are you seeing large performance differences? This likely isn't the cause, but I will note the the clinicalBERT model is cased so in the off chance you lowercased all of your text that could contribute.

EmilyAlsentzer commented 3 years ago

Closing this for now, but feel free to reopen if you have any follow up questions.