[Multi-Modal Image+Text] Explore available software packages to pre-process the text reports. - Githubissues

batmanlab / BatmanLabWiki

Documents and Wiki of the lab

Apache License 2.0

3 stars 0 forks source link

[Multi-Modal Image+Text] Explore available software packages to pre-process the text reports. #36

Open sumedhasingla opened 6 years ago

sumedhasingla commented 6 years ago

Location: /pghbio/dbmi/batmanlab/Data/radiologyTextDataset2/singla/RAD-ALL.deid

Keyword or concept tagging

[x] Noble Coder Named Entity Recognition (NER) engine for biomedical text. DBMI tool . Can be used with TIES
[x] TIES inbuild annotation tool
Work with Mike to get the annotations which TIES stores for each report.
[ ] Apache cTAKES

Negation identifier

[ ] DEEPEN - https://github.com/jessica-glover/deepen
[x] NegEx - https://github.com/mongoose54/negex/tree/master/negex.python
- https://github.com/chapmanbe/negex
[x] pyConTextNLP - https://github.com/chapmanbe/pyConTextNLP

Pre-processing for VQA

[ ] Converting the report to question

Pre-processing for Image Captioning

[ ] Converting the report to a template

kayhan-batmanghelich commented 6 years ago

@pyadolla the template is explained in this issue (one of the papers there): https://github.com/kayhan-batmanghelich/BatmanLab/issues/35

kayhan-batmanghelich commented 6 years ago

Here is a link to the standford software: https://nlp.stanford.edu/software/lex-parser.shtml

it is based on this link: https://stackoverflow.com/questions/19145948/converting-an-english-statement-into-a-questi0n

sumedhasingla commented 6 years ago

Relevant MICCAI 2018 paper: TextRay: Mining Clinical Reports to Gain a Broad Understanding of Chest X-rays

kayhan-batmanghelich commented 6 years ago

In this paper, they used a tool from NIH called Medical Text Indexer. Here is what they did:

This might be helpful for tagging. Please take a look.

kayhan-batmanghelich commented 6 years ago

@pyadolla would you please add the results of the CliNER here for the record.

sumedhasingla commented 6 years ago

@Sumedha Run TIES, Medical Text Indexer (MTI), CliNER on Finding and Impression sections of the report.

kayhan-batmanghelich commented 6 years ago

@sumedhasingla if you got some preliminary results from TIES, paste an example here.

sumedhasingla commented 5 years ago

NOBLE Tool, extensively tags the reports with the concepts + semantic type with a chosen thesaurus. I am using "NCI_Metathesaurus". The concepts found by NOBLe are used as input for pyContext to find the negations. The result of NOBLE Tool tag on about 8k reports is at location: '/pghbio/dbmi/batmanlab/Data/radiologyTextDataset2/singla/RAD-ALL-NOBLE-ContextPY-ImageFileName.csv'

sumedhasingla commented 5 years ago

TIES annotation tool, cannot run the reports we have in RAD-ALL.deid as these reports were directly extracted from MARS and there is no way to query them or find them through TIES interface.

To process and get tags using TIES tool, we again have to extract reports from the TIES (500k) and save annotation information with the report. We ran this process through a small sample of about 5k reports. The results are at location: /pghbio/dbmi/batmanlab/Data/radiologyTextDataset2/Reports/test-concepts

The problem with this approach is, TIES can handle these annotation for only 5k files at a time. The process have to re-run after every 5K reports. Also, while building query in TIES to extract reports, the queries should be such that the number of reports , resulted from the query is atmost 5K.

As, TIES uses NOBLE Tool under the hood. So may be we can skip TIES annotation.

sumedhasingla commented 5 years ago

An analysis of the unique word in these 8k reports. Vocabulary size: 7,495 Top-20 semantic type Top-20 concept words