EmilyAlsentzer / clinicalBERT

repository for Publicly Available Clinical BERT Embeddings
MIT License
658 stars 134 forks source link

Question: Better to use regular BioBERT on a dataset without marked PHI? #37

Closed caleb-lindgren closed 3 years ago

caleb-lindgren commented 3 years ago

Thanks for this cool resource. I'm just trying to figure out if it's the best model for my project. In the results section of your paper, it says:

De-ID challenge data presents a different data distribution than MIMIC text. In MIMIC, PHI is identified and replaced with sentinel PHI markers, whereas in the de-ID task, PHI is masked with synthetic, but realistic PHI. This data drift would be problematic for any embedding model, but will be especially damaging to contextual embedding models like BERT because the underlying sentence structure will have changed: in raw MIMIC,sentences with PHI will universally have a sentinel PHI token. In contrast, in the de-ID corpus, all such sentences will have different synthetic masks, meaning that a canonical, nearly constant sentence structure present during BERT’s training will be non-existent at task-time. For these reasons, we think it is sensible that clinical BERT is not successful on the de-ID corpora.

I'm working with EHR for patients with multiple myeloma. The records are not de-identified in any way--they're just the regular doctors' notes, lab reports, etc. with real place names, person names, and dates. So to me, it sounds like my data is more like the de-ID dataset than the MIMIC dataset, since PHI aren't tagged in any way. Would I possibly be better off just using the regular BioBERT model then, since that model performed better on the de-ID dataset?

EmilyAlsentzer commented 3 years ago

I think that it should be fine to use as long as you're not explicitly doing a de-identification task. Clinical BERT outperformed on all of the other i2b2 challenges, and I know folks have used clinicalBERT with notes containing PHI.

That being said, you could also try Microsoft's more recent Pubmed BERT for comparison. It isn't trained on clinical text, which is a big downside, but has the benefit of being pretrained from scratch with a new vocabulary.

Hope this helps. Feel free to reopen if you have any more questions.