baeseongsu / ehrxqa

EHRXQA: A Multi-Modal Question Answering Dataset for Electronic Health Records with Chest X-ray Images, NeurIPS 2023 D&B
MIT License
65 stars 4 forks source link

EHRXQA: A Multi-Modal Question Answering Dataset for Electronic Health Records with Chest X-ray Images

A multi-modal question answering dataset that combines structured Electronic Health Records (EHRs) and chest X-ray images, designed to facilitate joint reasoning across imaging and table modalities in EHR Question Answering (QA) systems.

Overview

Electronic Health Records (EHRs), which contain patients' medical histories in various multi-modal formats, often overlook the potential for joint reasoning across imaging and table modalities underexplored in current EHR Question Answering (QA) systems. In this paper, we introduce EHRXQA, a novel multi-modal question answering dataset combining structured EHRs and chest X-ray images. To develop our dataset, we first construct two uni-modal resources: 1) The MIMIC-CXR-VQA dataset, our newly created medical visual question answering (VQA) benchmark, specifically designed to augment the imaging modality in EHR QA, and 2) EHRSQL (MIMIC-IV), a refashioned version of a previously established table-based EHR QA dataset. By integrating these two uni-modal resources, we successfully construct a multi-modal EHR QA dataset that necessitates both uni-modal and cross-modal reasoning. To address the unique challenges of multi-modal questions within EHRs, we propose a NeuralSQL-based strategy equipped with an external VQA API. This pioneering endeavor enhances engagement with multi-modal EHR sources and we believe that our dataset can catalyze advances in real-world medical scenarios such as clinical decision-making and research.

Updates

Features

Installation

For Linux:

Ensure that you have Python 3.8.5 or higher installed on your machine. Set up the environment and install the required packages using the commands below:

# Set up the environment
conda create --name ehrxqa python=3.8.5

# Activate the environment
conda activate ehrxqa

# Install required packages
pip install pandas==1.1.3 tqdm==4.65.0 scikit-learn==0.23.2 
pip install dask=='2022.12.1'

Setup

Clone this repository and navigate into it:

git clone https://github.com/baeseongsu/ehrxqa.git
cd ehrxqa

Usage

Privacy

We take data privacy very seriously. All of the data you access through this repository has been carefully prepared to prevent any privacy breaches or data leakage. You can use this data with confidence, knowing that all necessary precautions have been taken.

Access Requirements

The EHRXQA dataset is constructed from the MIMIC-CXR-JPG (v2.0.0), Chest ImaGenome (v1.0.0), and MIMIC-IV (v2.2). All these source datasets require a credentialed Physionet license. Due to these requirements and in adherence to the Data Use Agreement (DUA), only credentialed users can access the MIMIC-CXR-VQA dataset files (see Access Policy). To access the source datasets, you must fulfill all of the following requirements:

  1. Be a credentialed user
    • If you do not have a PhysioNet account, register for one here.
    • Follow these instructions for credentialing on PhysioNet.
    • Complete the "CITI Data or Specimens Only Research" training course.
  2. Sign the data use agreement (DUA) for each project

Accessing the EHRXQA Dataset

While the complete EHRXQA dataset is being prepared for publication on the Physionet platform, we provide partial access to the dataset via this repository for credentialed users.

To access the EHRXQA dataset, you can run the provided main script (which requires your unique Physionet credentials) in this repository as follows:

bash build_dataset.sh

During script execution, enter your PhysioNet credentials when prompted:

This script performs several actions: 1) it downloads the source datasets from Physionet, 2) preprocesses these datasets, and 3) generates the complete EHRXQA dataset by creating ground-truth answer information.

Ensure you keep your credentials secure. If you encounter any issues, please ensure that you have the necessary permissions, a stable internet connection, and all prerequisite tools installed.

Downloading MIMIC-CXR-JPG Images

Dataset Structure

The dataset is structured as follows:

ehrxqa
└── dataset
    ├── _train_.json
    ├── _valid.json
    ├── _test.json
    ├── train.json (available post-script execution)
    ├── valid.json (available post-script execution)
    └── test.json  (available post-script execution)

Dataset Description

The QA samples in the EHRXQA dataset are stored in individual .json files. Each file contains a list of Python dictionaries, with each key indicating:

After validating PhysioNet credentials, the create_answer.py script generates the following items:

To be specific, here is the example instance:

{
    'db_id': 'mimic_iv_cxr', 
    'split': 'train',
    'id': 0, 
    'question': 'how many days have passed since the last chest x-ray of patient 18679317 depicting any anatomical findings in 2105?', 
    'template': 'how many days have passed since the last time patient 18679317 had a chest x-ray study indicating any anatomicalfinding in 2105?', 
    'query': 'select 1 * ( strftime(\'%J\',current_time) - strftime(\'%J\',t1.studydatetime) ) from ( select tb_cxr.study_id, tb_cxr.studydatetime from tb_cxr where tb_cxr.study_id in ( select distinct tb_cxr.study_id from tb_cxr where tb_cxr.subject_id = 18679317 and strftime(\'%Y\',tb_cxr.studydatetime) = \'2105\' ) ) as t1 where func_vqa("is the chest x-ray depicting any anatomical findings?", t1.study_id) = true', 
    'value': {'patient_id': 18679317}, 
    'q_tag': 'how many [unit_count] have passed since the [time_filter_exact1] time patient {patient_id} had a chest x-ray study indicating any ${category} [time_filter_global1]?', 
    't_tag': ['abs-year-in', '', '', 'exact-last', ''], 
    'o_tag': {'unit_count': {'nlq': 'days', 'sql': '1 * ', 'type': 'days', 'sql_pattern': '[unit_count]'}}, 
    'v_tag': {'object': [], 'category': ['anatomicalfinding'], 'attribute': []}, 
    'tag': 'how many [unit_count:days] have passed since the [time_filter_exact1:exact-last] time patient {patient_id} had a chest x-ray study indicating any anatomicalfinding [time_filter_global1:abs-year-in]?',
    'para_type': 'machine', 
    'is_impossible': False, 
    'answer': 'Will be generated by dataset_builder/generate_answer.py'
}

Versioning

We employ semantic versioning for our dataset, with the current version being v1.0.0. Generally, we will maintain and provide updates only for the latest version of the dataset. However, in cases where significant updates occur or when older versions are required for validating previous research, we may exceptionally retain previous dataset versions for a period of up to one year. For a detailed list of changes made in each version, check out our CHANGELOG.

Contributing

Contributions to enhance the usability and functionality of this dataset are always welcomed. If you're interested in contributing, feel free to fork this repository, make your changes, and then submit a pull request. For significant changes, please first open an issue to discuss the proposed alterations.

Contact

For any questions or concerns regarding this dataset, please feel free to reach out to us (seongsu@kaist.ac.kr or kyungdaeun@kaist.ac.kr). We appreciate your interest and are eager to assist.

Acknowledgements

More details will be provided soon.

Citation

When you use the EHRXQA dataset, we would appreciate it if you cite the following:

@article{bae2023ehrxqa,
  title={EHRXQA: A Multi-Modal Question Answering Dataset for Electronic Health Records with Chest X-ray Images},
  author={Bae, Seongsu and Kyung, Daeun and Ryu, Jaehee and Cho, Eunbyeol and Lee, Gyubok and Kweon, Sunjun and Oh, Jungwoo and Ji, Lei and Chang, Eric I and Kim, Tackeun and others},
  journal={arXiv preprint arXiv:2310.18652},
  year={2023}
}

License

The code in this repository is provided under the terms of the MIT License. The final output of the dataset created using this code, the EHRXQA, is subject to the terms and conditions of the original datasets from Physionet: MIMIC-CXR-JPG License, Chest ImaGenome License, and MIMIC-IV License.