Open sumedhasingla opened 6 years ago
Results on 7800 reports (we have matching images for a subset of these reports: RAD-ALL-Findings-Impressions.csv) Most frequent words: Current Size of vocabulary: 7495 distinct words
The legnth of FINDINGS varies between 0 words to 651 words The legnth of IMPRESSIONS varies between 0 words to 327 words The legnth of FINDINGS varies between 0 sentences to 66 sentences The legnth of IMPRESSIONS varies between 0 sentences to 30 sentences
FYI @pyadolla
Thanks @sumedhasingla . This word count figure is very informative. The most frequent words are "no", "right" and "left" which shows we have to find a good negation pipeline.
Also, what do you think about this:
If we get a radiologist intern (I am working on it), do you think it makes sense to draw an "attention" box for each sentence?
@sumedhasingla can you show wordcount for impression and finding separately ?
Results on 548,369 reports. These are all the reports in folder: '/pghbio/dbmi/batmanlab/Data/radiologyTextDataset2/Reports'. It contains reports from Lung (2013-2017), spine (2016-2017), neck (2016-2017), head (2017), abdomen (2017).
The impression and finding sections were extracted from all the reports and saved at location: '/pghbio/dbmi/batmanlab/Data/radiologyTextDataset2/singla/Processed_Reports/Text_To_Train_Glove'
Distribution of words The length of FINDINGS varies between 1 words to 1586 words and 1 sentences to 127 sentences. The length of IMPRESSIONS varies between 1 words to 693 words and 1 sentences to 67 sentences.
Most frequentwords in findings
Most frequent words in impressions
Thanks @sumedhasingla
To calculate the statistics about the most frequent tags, I ran NOBLE tool on 6K reports and compute the inverse-document frequency. Top 20 concepts and semantic types
Concept | idf | SemanticType | idf | |
---|---|---|---|---|
impression (attitude) | 0.996964 | Finding | 1 | |
Radiologic Impression | 0.996964 | Qualitative Concept | 0.999831 | |
Unremarkable | 0.718455 | Body Part, Organ, or Organ Component | 0.999325 | |
Mass | 0.647267 | Mental Process | 0.997132 | |
Breast Lump | 0.589575 | Spatial Concept | 0.994433 | |
Focus (area of enhancement) | 0.56444 | Disease or Syndrome | 0.994433 | |
Pericardial effusion | 0.550101 | Quantitative Concept | 0.986336 | |
Mild | 0.538124 | Functional Concept | 0.976889 | |
Lesion | 0.529352 | Pathologic Function | 0.962719 | |
Pericardial effusion body substance | 0.512146 | Body Location or Region | 0.947031 | |
Pleural Effusion | 0.505735 | Intellectual Product | 0.93303 | |
Lymphatic Diseases | 0.486842 | Idea or Concept | 0.898617 | |
Lymphadenopathy | 0.486842 | Temporal Concept | 0.842611 | |
Mild Adverse Event | 0.443826 | Body Substance | 0.763158 | |
Nodule | 0.433536 | Conceptual Entity | 0.708333 | |
centimeter | 0.432692 | Therapeutic or Preventive Procedure | 0.637314 | |
No status change | 0.431174 | Neoplastic Process | 0.615385 | |
Curium | 0.430837 | Acquired Abnormality | 0.61471 | |
Small | 0.430162 | Tissue | 0.612854 |
@sumedhasingla hmmm... many of those terms are too general and hence non-informative. Interesting things to look into:
Pleural Effusion Lymphadenopathy Pericardial effusion --- why is this intellectual product?! Pericardial effusion body substance
What are those? Would you please take a look.
I also wonder what are the examples of the following concepts: Mental Process Intellectual Product Conceptual Entity Neoplastic Process
There are semantic tags that we should be able to predict for example see below. I think these should have salient visual representation:
@sumedhasingla could you please eyebal which concepts/semantic types co-occur. Given that we are going to do multi-label prediction, knowing this will be helpful.
Thanks @sumedhasingla
We have been using Noble Coder to identify SNOMED-CT concepts in CT reports of patients with trauma. The goal is to identify sentences that are incidental findings in trauma CT reports and the work is being done by Gaurav Trivedi in ISP. We noticed that the versions of vocabularies that come with Noble are quite old.
You can get the latest UMLS in NLM format, unzip and concatenate MR*.RRF files (cat MRCONSO.RRF.a > MRCONSO.RRF) and use Noble Coder to import as RRF. If you need the concept type hierarchy download the latest NCI Metathesausus in RRF format. Contact Eugene Tseytlin at eugene.tseytlin@gmail.com Iif you need more information about Noble Coder.
The UMLS semantic types are at https://www.nlm.nih.gov/research/umls/META3_current_semantic_types.html. If useful you can selectively annotate only those types that are of interest.
In our experiments so far it seems that bag of words with and without embeddings is not improved by adding UMLS or SNOMED-CT concepts.
Thanks @sumedhasingla for the comment.
@sumedhasingla @pyadolla FYI.
Result of NOBLE tool on 6k reports. To get better understanding of the semantic type and concept, I collected the stats about how many times any word is tagged with a given (semantic-type and concept) pair. Below are some examples of concepts with in a semantic type: Mental Process : 20 (Number of concepts in this semantic type) Regression - mental defense mechanism Content impression (attitude) Psychological habituation Suspicion Persistence Interpretation Attention Concentration Psychological fixation Body Image Intellectual Product : 165 (Number of concepts in this semantic type) Uncertain Observation Interpretation - intermediate Group Object Financial Account International Classification of Diseases Participation Type - device attestation Leaflet Severe Extremity Pain Pattern of Bowel Movements Conceptual Entity : 80 (Number of concepts in this semantic type) Upper Limit of Normal Study Terminated Combination Computer Configuration Fusiform shape Content Background dictated - ParticipationMode Spicule Low Risk Neoplastic Process : 69 (Number of concepts in this semantic type) Breast Carcinoma Esophageal Neoplasms Thyroid Gland Nodule Metastatic Malignant Neoplasm in the Pleura Cyst Metastatic Malignant Neoplasm in the Lymph Nodes Minimal Residual Disease Mass of thyroid gland Leiomyoma Plasma Cell Myeloma
The interesting semantic types are: 'Acquired Abnormality', 'Anatomical Abnormality', 'Congenital Abnormality', 'Sign or Symptom', 'Finding', 'Pathologic Function', 'Disease or Syndrome', 'Disease or Syndrome, Anatomical Abnormality'
[x] Extract just the finding and impression sections from the reports
[x] Get an idea of distribution of length (number of words, sentences) of finding and impression section in different reports.
[x] Perform the word-count. Find the most frequent words.
[ ] Find the most frequent sentences.
Group the similar sentences
Fix some matrix to define sentence similarity
[x] Train word2vec on the large text-report dataset (500K reports-spine, chest CT, head CT, others)
Train different dimensions for word embedding d =100 to 300
[x] Get a statistics about the most frequent tags. (tags are obtained from NOBLE tool, etc)