VectorInstitute / odyssey

A toolkit for developing foundation models using Electronic Health Record (EHR) data.
https://vectorinstitute.github.io/EHRMamba
Apache License 2.0
26 stars 10 forks source link

Add NDJSON loading script and update database connection settings #92

Open zzadxz opened 1 month ago

zzadxz commented 1 month ago

PR Type

[Feature, Fix, Documentation]

Short Description

PR Summary:

Tests Added

No specific unit tests were added for this script. The script was tested manually with sample NDJSON files with a successful data import into PostgreSQL.

Issue Reference

Closes #84 – resolves the need for loading MIMIC-IV FHIR NDJSON files into the PostgreSQL database for use with collect.py.

Detailed Description

1. Added Script:
The new script load_ndjson_to_postgres.py reads each NDJSON file in a specified directory, flattens nested JSON data where necessary, and loads the data into a specified PostgreSQL database. This streamlines the process of loading large FHIR datasets for analysis.

2. Updates to collect.py:
The script collect.py now uses environment variables to fetch database credentials.

3. .gitignore Update:
Excluded the physionet.org in case users download the dataset inside the main repository.

Environment Variable Setup:
To use the new scripts, please set up environment variables:


export DB_HOST=localhost
export DB_PORT=5432
export DB_NAME=mimiciv_fhir
export DB_USER=your_username
export DB_PASSWORD=your_password
amrit110 commented 1 month ago

@zzadxz thanks for this PR! Great to see you think of ideas to improve the repo.

zzadxz commented 1 month ago

@amrit110 Of course! More than happy to help :)