[Project] Automatically label medical data from diagnosis reports

xis10z commented 5 months ago

Title: Automatically label medical data from diagnosis reports Project Lead: Frank Langbein, frank@langbein.org

Description: We wish to automatically label medical diagnosis data (MRI, CT, ultrasound, genomics, histopathology, etc) using the text in diagnosis/pathology reports. Resulting labelled data may be subsequently used to train and validate classifiers on the diagnosis data or for semi/un-supervised segmentation (beyond the scope of this project). Labels of interest include cancer-type, sub-type, approximate region specification, etc, but also patient characteristics such as age, gender, history, etc.

We expect to use online or local, or both, large language models (LLM) to process the diagnosis data and generate the tags. To develop this we can use a TCGA (The Cancer Genome Atlas) derived diagnosis report dataset and the data on TCGA itself. The target for this is ultimately other cancers and data, but this is a good dataset to start and explore the approach. Open to other suggestions.

Ideal Participant Characteristics: Generally good programming skills, likely Python (it may not get that complex if the LLMs can do a good job). Experience with LLMs, prompt engineering, fine-tuning. Some understanding of the medical context (specifically TCGA/The Cancer Genome Atlas data; see dataset below).

Resources:

Libraries/Software: Python, LLMs (online models as well as local), machine learning framework (depending on LLM used)
Data:
- TCGA-Reports: A Machine-Readable Pathology Report Resource for Benchmarking Text-Based AI Models
  - https://data.mendeley.com/datasets/hyg5xkznpx/1
  - https://pubmed.ncbi.nlm.nih.gov/38487800/ (read this for more)
- The Cancer Genome Atlas Program (TCGA)
  - https://www.cancer.gov/ccg/research/genome-sequencing/tcga
Hardware: GPUs (colab, ARCA, Linux lab(?))

[First Task]: Preprocessing Obtain and understand the diagnosis report dataset (a CSV file with patient labels and the diagnosis report) and link it to the TCGA data (via the patient labels) / decide what tags to use.

[Second Task]: Prompt Engineering Likely the LLM has to be prompted to create summaries and answer specific questions with definitions of what is being sought. Explore LLMs ability to answer the questions and measure performance.

[Third Task]: Fine Tuning Explore LoRA approaches to fine tune (a local) LLM and measure performance.

[Fourth Task]: Document Results

ThomasGreatrix commented 5 months ago

Tried to download the "TCGA-Reports" data. It contained a CSV file, a large folder of .p files, and a corrupted folder ("imgs_for_aws.zip").

I've tried redownloading it a few times, and the image folder is always corrupted. Am I opening it with the wrong program, or is it just corrupt?

Also, what are the .p files? I've never come across this file type before.

xis10z commented 5 months ago

The .p files look like they contain the AWS txtract results; I don't know the details, that this was used to convert PDFs to the text records in the csv as far as I can tell.

The zip file is broken (can be fixed with zip -FF to get some of the data), but it only contains the images of the original reports, so not needed. You get the PDFs on TCGA as well (via patient_filename).

The actual OCR results are in the csv file, which is where I'd start from. It needs to be linked to the cancer types (or any other tags we wish to generate) on the TCGA site itself. As far as I can tell the tags are not included in the reports directly. Search for the first part (up to .) in the patient_filename from the CSV file on https://portal.gdc.cancer.gov/

Lewis-Kitchingman / VIC-HACK-2024

[Project] Automatically label medical data from diagnosis reports #5