batmanlab / BatmanLabWiki

Documents and Wiki of the lab
Apache License 2.0
3 stars 0 forks source link

[Multi-Modal Image+Text] Preprocess the text reports #55

Open sumedhasingla opened 6 years ago

sumedhasingla commented 6 years ago
sumedhasingla commented 6 years ago

Results on 7800 reports (we have matching images for a subset of these reports: RAD-ALL-Findings-Impressions.csv) Most frequent words: image Current Size of vocabulary: 7495 distinct words

The legnth of FINDINGS varies between 0 words to 651 words The legnth of IMPRESSIONS varies between 0 words to 327 words The legnth of FINDINGS varies between 0 sentences to 66 sentences The legnth of IMPRESSIONS varies between 0 sentences to 30 sentences image

kayhan-batmanghelich commented 6 years ago

FYI @pyadolla

Thanks @sumedhasingla . This word count figure is very informative. The most frequent words are "no", "right" and "left" which shows we have to find a good negation pipeline.

Also, what do you think about this:

If we get a radiologist intern (I am working on it), do you think it makes sense to draw an "attention" box for each sentence?

kayhan-batmanghelich commented 6 years ago

@sumedhasingla can you show wordcount for impression and finding separately ?

sumedhasingla commented 6 years ago

Results on 548,369 reports. These are all the reports in folder: '/pghbio/dbmi/batmanlab/Data/radiologyTextDataset2/Reports'. It contains reports from Lung (2013-2017), spine (2016-2017), neck (2016-2017), head (2017), abdomen (2017).

Most frequentwords in findings image

Most frequent words in impressions image

kayhan-batmanghelich commented 6 years ago

Thanks @sumedhasingla

sumedhasingla commented 6 years ago

To calculate the statistics about the most frequent tags, I ran NOBLE tool on 6K reports and compute the inverse-document frequency. Top 20 concepts and semantic types

Concept idf   SemanticType idf
impression (attitude) 0.996964   Finding 1
Radiologic Impression 0.996964   Qualitative Concept 0.999831
Unremarkable 0.718455   Body Part, Organ, or Organ Component 0.999325
Mass 0.647267   Mental Process 0.997132
Breast Lump 0.589575   Spatial Concept 0.994433
Focus (area of enhancement) 0.56444   Disease or Syndrome 0.994433
Pericardial effusion 0.550101   Quantitative Concept 0.986336
Mild 0.538124   Functional Concept 0.976889
Lesion 0.529352   Pathologic Function 0.962719
Pericardial effusion body substance 0.512146   Body Location or Region 0.947031
Pleural Effusion 0.505735   Intellectual Product 0.93303
Lymphatic Diseases 0.486842   Idea or Concept 0.898617
Lymphadenopathy 0.486842   Temporal Concept 0.842611
Mild Adverse Event 0.443826   Body Substance 0.763158
Nodule 0.433536   Conceptual Entity 0.708333
centimeter 0.432692   Therapeutic or Preventive Procedure 0.637314
No status change 0.431174   Neoplastic Process 0.615385
Curium 0.430837   Acquired Abnormality 0.61471
Small 0.430162   Tissue 0.612854
kayhan-batmanghelich commented 6 years ago

@sumedhasingla hmmm... many of those terms are too general and hence non-informative. Interesting things to look into:

Pleural Effusion Lymphadenopathy Pericardial effusion --- why is this intellectual product?! Pericardial effusion body substance

What are those? Would you please take a look.

I also wonder what are the examples of the following concepts: Mental Process Intellectual Product Conceptual Entity Neoplastic Process

There are semantic tags that we should be able to predict for example see below. I think these should have salient visual representation:

@sumedhasingla could you please eyebal which concepts/semantic types co-occur. Given that we are going to do multi-label prediction, knowing this will be helpful.

Thanks @sumedhasingla

shyamvis commented 6 years ago

We have been using Noble Coder to identify SNOMED-CT concepts in CT reports of patients with trauma. The goal is to identify sentences that are incidental findings in trauma CT reports and the work is being done by Gaurav Trivedi in ISP. We noticed that the versions of vocabularies that come with Noble are quite old.

You can get the latest UMLS in NLM format, unzip and concatenate MR*.RRF files (cat MRCONSO.RRF.a > MRCONSO.RRF) and use Noble Coder to import as RRF. If you need the concept type hierarchy download the latest NCI Metathesausus in RRF format. Contact Eugene Tseytlin at eugene.tseytlin@gmail.com Iif you need more information about Noble Coder.

The UMLS semantic types are at https://www.nlm.nih.gov/research/umls/META3_current_semantic_types.html. If useful you can selectively annotate only those types that are of interest.

In our experiments so far it seems that bag of words with and without embeddings is not improved by adding UMLS or SNOMED-CT concepts.

kayhan-batmanghelich commented 6 years ago

Thanks @sumedhasingla for the comment.

@sumedhasingla @pyadolla FYI.

sumedhasingla commented 6 years ago

Result of NOBLE tool on 6k reports. To get better understanding of the semantic type and concept, I collected the stats about how many times any word is tagged with a given (semantic-type and concept) pair. Below are some examples of concepts with in a semantic type: Mental Process : 20 (Number of concepts in this semantic type) Regression - mental defense mechanism Content impression (attitude) Psychological habituation Suspicion Persistence Interpretation Attention Concentration Psychological fixation Body Image Intellectual Product : 165 (Number of concepts in this semantic type) Uncertain Observation Interpretation - intermediate Group Object Financial Account International Classification of Diseases Participation Type - device attestation Leaflet Severe Extremity Pain Pattern of Bowel Movements Conceptual Entity : 80 (Number of concepts in this semantic type) Upper Limit of Normal Study Terminated Combination Computer Configuration Fusiform shape Content Background dictated - ParticipationMode Spicule Low Risk Neoplastic Process : 69 (Number of concepts in this semantic type) Breast Carcinoma Esophageal Neoplasms Thyroid Gland Nodule Metastatic Malignant Neoplasm in the Pleura Cyst Metastatic Malignant Neoplasm in the Lymph Nodes Minimal Residual Disease Mass of thyroid gland Leiomyoma Plasma Cell Myeloma

sumedhasingla commented 6 years ago

The interesting semantic types are: 'Acquired Abnormality', 'Anatomical Abnormality', 'Congenital Abnormality', 'Sign or Symptom', 'Finding', 'Pathologic Function', 'Disease or Syndrome', 'Disease or Syndrome, Anatomical Abnormality'

sumedhasingla commented 6 years ago

image