Question: Better to use regular BioBERT on a dataset without marked PHI?

Thanks for this cool resource. I'm just trying to figure out if it's the best model for my project. In the results section of your paper, it says:

De-ID challenge data presents a different data distribution than MIMIC text. In MIMIC, PHI is identified and replaced with sentinel PHI markers, whereas in the de-ID task, PHI is masked with synthetic, but realistic PHI. This data drift would be problematic for any embedding model, but will be especially damaging to contextual embedding models like BERT because the underlying sentence structure will have changed: in raw MIMIC,sentences with PHI will universally have a sentinel PHI token. In contrast, in the de-ID corpus, all such sentences will have different synthetic masks, meaning that a canonical, nearly constant sentence structure present during BERT’s training will be non-existent at task-time. For these reasons, we think it is sensible that clinical BERT is not successful on the de-ID corpora.

I'm working with EHR for patients with multiple myeloma. The records are not de-identified in any way--they're just the regular doctors' notes, lab reports, etc. with real place names, person names, and dates. So to me, it sounds like my data is more like the de-ID dataset than the MIMIC dataset, since PHI aren't tagged in any way. Would I possibly be better off just using the regular BioBERT model then, since that model performed better on the de-ID dataset?

EmilyAlsentzer / clinicalBERT

Question: Better to use regular BioBERT on a dataset without marked PHI? #37