EmilyAlsentzer / clinicalBERT

repository for Publicly Available Clinical BERT Embeddings
MIT License
658 stars 133 forks source link

Multi-label classification of clinical text #21

Closed SaiTeja390 closed 4 years ago

SaiTeja390 commented 4 years ago

I' trying to do MLC using the pre-trained weights (trained on all notes in this paper). The data is little biased i.e., some classes more frequently than others. After applying ML-ROS oversampling technique, Mean IRBl reduced, but still the data is biased, so the model is predicting the most frequently occurring labels everytime (for any random input). Do you have any suggestions here?

EmilyAlsentzer commented 4 years ago

If I'm understanding your question correctly, it sounds like this issue is occurring because the labels of your downstream task are biased. I think this question is out of scope for this repo, and I would recommend you ask it on a website like Cross Validated.

If you're concerned about bias in the data sources for the language model, remember that the clinicalBERT models are trained on either all notes in MIMIC or only discharge summaries. The models trained on all notes in MIMIC will likely be biased towards Nursing/other and Radiology notes since those are most frequent (see the table in the Appendix in the clinicalBERT paper for distributions of note types).

Feel free to reopen if you think that this question is more directly related to the clinicalBERT models.