Closed caleb-lindgren closed 3 years ago
I think that it should be fine to use as long as you're not explicitly doing a de-identification task. Clinical BERT outperformed on all of the other i2b2 challenges, and I know folks have used clinicalBERT with notes containing PHI.
That being said, you could also try Microsoft's more recent Pubmed BERT for comparison. It isn't trained on clinical text, which is a big downside, but has the benefit of being pretrained from scratch with a new vocabulary.
Hope this helps. Feel free to reopen if you have any more questions.
Thanks for this cool resource. I'm just trying to figure out if it's the best model for my project. In the results section of your paper, it says:
I'm working with EHR for patients with multiple myeloma. The records are not de-identified in any way--they're just the regular doctors' notes, lab reports, etc. with real place names, person names, and dates. So to me, it sounds like my data is more like the de-ID dataset than the MIMIC dataset, since PHI aren't tagged in any way. Would I possibly be better off just using the regular BioBERT model then, since that model performed better on the de-ID dataset?