Datasets - Githubissues

phaniram-sayapaneni commented 4 years ago

Hello @EmilyAlsentzer,

This is a great contribution to the open source community! I have read your paper thoroughly: https://www.aclweb.org/anthology/W19-1909.pdf

I have few questions:

I would love to try out both clinicalBERT + BioBERT on few downstream tasks (disease identification), however I donot have lot of training dataset (infact zero training datasets) . Could you please point me to some available open source data repositories which already have: notes --> disease, mapping?
I see you have used typical BERT pertaining approach(MLM), however I would like to explore other pertaining strategies such as (Replaced Token Detection, from ELECTRA etc.) I also see for pertaining you have used MIMIC-III datasets, I dont have access to this dataset, to evaluate. What would you suggest for pertaining datasets?
I also would love try new variants of transformers (larger ones, low parameter ones) + do multitask learning , so datasets (de-ID, non PHI sufficient) seems to be bottleneck, how to over come this ?

Would open source all my work in py-torch, if I could find a tangible data source. Please let me know. Thanks!

tnaumann commented 4 years ago

A few questions for clarification:

For additional publicly-available datasets, you may want to take a look at the Clinical Natural Language Processing Workshop resources page: https://clinical-nlp.github.io/2020/resources.html. Since disease is often a characteristic of a patient, can you clarify what you mean by a notes --> disease mapping?
Great! There are a number of new strategies that have been developed over the last year, and it would be fantastic to see an exploration of the impact of these strategies for clinical notes. The MIMIC-III dataset is publicly-available and you can apply for access if you would like to use the same underlying data: https://mimic.mit.edu/. Are you specifically looking for a different dataset than MIMIC?
It would also be fantastic to see an exploration of additional transformers. Is the question here, how do we enable access to large clinical corpora?

phaniram-sayapaneni commented 4 years ago

Hi @tnaumann,

Thanks, https://clinical-nlp.github.io/2020/resources.html, is very helpful. Answering your questions:

notes --> disease, I was referring to potential datasets which have clinical notes with labeled (symptoms , diseases -ICD10 etc)
In the process of fetching access from: https://mimic.mit.edu/, I'm mostly looking for clinical notes/ symptoms/diagnosis related notes, for downstream tasks(diagnosis).
Would be nice to know, what pertaining data has been used(or how it was accumulated from mimic) for clinicalBERT, to replicate same data conditions but different pertaining technique

Thanks!

tnaumann commented 4 years ago

Thanks for the clarifications.

You can likely do this in MIMIC. Since ICD is a characteristic at the level of a patient-stay, you could imagine taking the discharge summary for a patient-stay and the corresponding ICD codes for that stay in order to provide such a mapping. That being said, you should be careful about the differences between ICD (billing) and disease (physiology). You might also look to some of the n2c2 (previously i2b2) tasks for symptom, disease, treatment span annotations.
Per above, you can likely grab these from MIMIC, but you might also want to check out the n2c2 NLP Research Data Sets (https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/) for the tasks that you're targeting.
Take a look at the lm_pretraining directory of this repository, in particular format_mimic_for_BERT.py, to see how MIMIC data were processed prior to pretraining.

EmilyAlsentzer commented 4 years ago

Closing this issue due to lack of activity, but feel free to reopen if you have any additional questions

EmilyAlsentzer / clinicalBERT