EmilyAlsentzer / clinicalBERT

repository for Publicly Available Clinical BERT Embeddings
MIT License
673 stars 135 forks source link

Question about mimic data preparation process #25

Closed passing2961 closed 4 years ago

passing2961 commented 4 years ago

Hi Emily,

After I acquire an access to MIMIC III database, I preprocess this data following your procedure (i.e. format_mimic_for_BERT.py).

But, I can not have a confidence about below results. Is it right result?

(after format_mimic_for_BERT.py)

Thanks Young-Jun

EmilyAlsentzer commented 4 years ago

Hi Young-Jun,

Those results look reasonable. Remember that the output of that script is to split the text into sentences. The BERT LM pretraining scripts take care of the tokenization.

Since the MIMIC data is protected by a DUA, I ask that you please remove the examples from your question above ASAP. Data from MIMIC (even if only small paragraphs) should not be posted publicly on the web without permission. Thanks.

passing2961 commented 4 years ago

Dear Emily,

Thanks for kind reply. As you said, I remove the examples.

Sorry for that, I will be careful from next time.

EmilyAlsentzer commented 4 years ago

Thanks for removing. I'm closing this issue, but feel free to reopen if your question wasn't fully addressed.