[Multi-Modal Image+Text] Preprocess the text reports

sumedhasingla commented 6 years ago

[x] Extract just the finding and impression sections from the reports
[x] Get an idea of distribution of length (number of words, sentences) of finding and impression section in different reports.
[x] Perform the word-count. Find the most frequent words.
[ ] Find the most frequent sentences.
Group the similar sentences
Fix some matrix to define sentence similarity
[x] Train word2vec on the large text-report dataset (500K reports-spine, chest CT, head CT, others)
Train different dimensions for word embedding d =100 to 300
[x] Get a statistics about the most frequent tags. (tags are obtained from NOBLE tool, etc)

sumedhasingla commented 6 years ago

Results on 7800 reports (we have matching images for a subset of these reports: RAD-ALL-Findings-Impressions.csv) Most frequent words: Current Size of vocabulary: 7495 distinct words

The legnth of FINDINGS varies between 0 words to 651 words The legnth of IMPRESSIONS varies between 0 words to 327 words The legnth of FINDINGS varies between 0 sentences to 66 sentences The legnth of IMPRESSIONS varies between 0 sentences to 30 sentences

kayhan-batmanghelich commented 6 years ago

FYI @pyadolla

Thanks @sumedhasingla . This word count figure is very informative. The most frequent words are "no", "right" and "left" which shows we have to find a good negation pipeline.

Also, what do you think about this:

If we get a radiologist intern (I am working on it), do you think it makes sense to draw an "attention" box for each sentence?

kayhan-batmanghelich commented 6 years ago

@sumedhasingla can you show wordcount for impression and finding separately ?

sumedhasingla commented 6 years ago

Results on 548,369 reports. These are all the reports in folder: '/pghbio/dbmi/batmanlab/Data/radiologyTextDataset2/Reports'. It contains reports from Lung (2013-2017), spine (2016-2017), neck (2016-2017), head (2017), abdomen (2017).

The impression and finding sections were extracted from all the reports and saved at location: '/pghbio/dbmi/batmanlab/Data/radiologyTextDataset2/singla/Processed_Reports/Text_To_Train_Glove'
Distribution of words The length of FINDINGS varies between 1 words to 1586 words and 1 sentences to 127 sentences. The length of IMPRESSIONS varies between 1 words to 693 words and 1 sentences to 67 sentences.

Most frequentwords in findings

Most frequent words in impressions

kayhan-batmanghelich commented 6 years ago

Thanks @sumedhasingla

sumedhasingla commented 6 years ago

To calculate the statistics about the most frequent tags, I ran NOBLE tool on 6K reports and compute the inverse-document frequency. Top 20 concepts and semantic types

Concept	idf	SemanticType	idf
impression (attitude)	0.996964	Finding	1
Radiologic Impression	0.996964	Qualitative Concept	0.999831
Unremarkable	0.718455	Body Part, Organ, or Organ Component	0.999325
Mass	0.647267	Mental Process	0.997132
Breast Lump	0.589575	Spatial Concept	0.994433
Focus (area of enhancement)	0.56444	Disease or Syndrome	0.994433
Pericardial effusion	0.550101	Quantitative Concept	0.986336
Mild	0.538124	Functional Concept	0.976889
Lesion	0.529352	Pathologic Function	0.962719
Pericardial effusion body substance	0.512146	Body Location or Region	0.947031
Pleural Effusion	0.505735	Intellectual Product	0.93303
Lymphatic Diseases	0.486842	Idea or Concept	0.898617
Lymphadenopathy	0.486842	Temporal Concept	0.842611
Mild Adverse Event	0.443826	Body Substance	0.763158
Nodule	0.433536	Conceptual Entity	0.708333
centimeter	0.432692	Therapeutic or Preventive Procedure	0.637314
No status change	0.431174	Neoplastic Process	0.615385
Curium	0.430837	Acquired Abnormality	0.61471
Small	0.430162	Tissue	0.612854

kayhan-batmanghelich commented 6 years ago

@sumedhasingla hmmm... many of those terms are too general and hence non-informative. Interesting things to look into:

Pleural Effusion Lymphadenopathy Pericardial effusion --- why is this intellectual product?! Pericardial effusion body substance

What are those? Would you please take a look.

I also wonder what are the examples of the following concepts: Mental Process Intellectual Product Conceptual Entity Neoplastic Process

There are semantic tags that we should be able to predict for example see below. I think these should have salient visual representation:

Neoplastic Process
Disease or Syndrome
Body Substance

@sumedhasingla could you please eyebal which concepts/semantic types co-occur. Given that we are going to do multi-label prediction, knowing this will be helpful.

Thanks @sumedhasingla

shyamvis commented 6 years ago

We have been using Noble Coder to identify SNOMED-CT concepts in CT reports of patients with trauma. The goal is to identify sentences that are incidental findings in trauma CT reports and the work is being done by Gaurav Trivedi in ISP. We noticed that the versions of vocabularies that come with Noble are quite old.

You can get the latest UMLS in NLM format, unzip and concatenate MR*.RRF files (cat MRCONSO.RRF.a > MRCONSO.RRF) and use Noble Coder to import as RRF. If you need the concept type hierarchy download the latest NCI Metathesausus in RRF format. Contact Eugene Tseytlin at eugene.tseytlin@gmail.com Iif you need more information about Noble Coder.

The UMLS semantic types are at https://www.nlm.nih.gov/research/umls/META3_current_semantic_types.html. If useful you can selectively annotate only those types that are of interest.

In our experiments so far it seems that bag of words with and without embeddings is not improved by adding UMLS or SNOMED-CT concepts.

kayhan-batmanghelich commented 6 years ago

Thanks @sumedhasingla for the comment.

@sumedhasingla @pyadolla FYI.

sumedhasingla commented 6 years ago

Result of NOBLE tool on 6k reports. To get better understanding of the semantic type and concept, I collected the stats about how many times any word is tagged with a given (semantic-type and concept) pair. Below are some examples of concepts with in a semantic type: Mental Process : 20 (Number of concepts in this semantic type) Regression - mental defense mechanism Content impression (attitude) Psychological habituation Suspicion Persistence Interpretation Attention Concentration Psychological fixation Body Image Intellectual Product : 165 (Number of concepts in this semantic type) Uncertain Observation Interpretation - intermediate Group Object Financial Account International Classification of Diseases Participation Type - device attestation Leaflet Severe Extremity Pain Pattern of Bowel Movements Conceptual Entity : 80 (Number of concepts in this semantic type) Upper Limit of Normal Study Terminated Combination Computer Configuration Fusiform shape Content Background dictated - ParticipationMode Spicule Low Risk Neoplastic Process : 69 (Number of concepts in this semantic type) Breast Carcinoma Esophageal Neoplasms Thyroid Gland Nodule Metastatic Malignant Neoplasm in the Pleura Cyst Metastatic Malignant Neoplasm in the Lymph Nodes Minimal Residual Disease Mass of thyroid gland Leiomyoma Plasma Cell Myeloma

sumedhasingla commented 6 years ago

The interesting semantic types are: 'Acquired Abnormality', 'Anatomical Abnormality', 'Congenital Abnormality', 'Sign or Symptom', 'Finding', 'Pathologic Function', 'Disease or Syndrome', 'Disease or Syndrome, Anatomical Abnormality'

sumedhasingla commented 6 years ago

batmanlab / BatmanLabWiki

[Multi-Modal Image+Text] Preprocess the text reports #55