MIT-LCP / mimic-code

MIMIC Code Repository: Code shared by the research community for the MIMIC family of databases
https://mimic.mit.edu
MIT License
2.43k stars 1.5k forks source link

DuckDB MIMIC-III concepts implementation #1529

Open SphtKr opened 1 year ago

SphtKr commented 1 year ago

These changes expand on #1489 / #1239 by adding a Python version that ingests the data and creates the concepts views in the DuckDB database. The python version requires only the duckdb python package and not the command-line executable. Also requires sqlglot.

Run python import-duckdb.py --help for usage, it's pretty simple.

Tested on the MIMIC-III demo and the full version (1.4)... it seems to work, though I'm not sure if there's a better way to test (i.e. compare actual values... I have the Postgres version running in a container if so).

Resolves #1495

alistairewj commented 1 year ago

Seems like there are a few valid changes, feel free to update in your own time and just ping me with a re-review when you think it's ready

SphtKr commented 1 year ago

I have a very interesting problem.

I noticed that the DuckDB version here is producing integer results for DATETIME_DIFF-type operations, e.g. icustay_detail.los_icu, but the PostgreSQL version (implemented in postgres-functions.sql) returns fractional values, e.g. 3.14 days vs. 3 days (the DuckDB version is in fact a floor of the Postgres version, it appears). So I was going to fix that...

Then, since I am actually converting the queries from the BigQuery versions, I consulted the BigQuery documentation for DATETIME_DIFF which says "Returns the whole number of specified part intervals..." so it looks like the BigQuery version of the concepts may differ from the Postgres versions in this area--though I don't have easy access to the BigQuery version to test.

It looks like I can do either in the DuckDB version by changing how the SQL is translated... I may even put in an option to do either... but which one should it be, if there is a "should"? If nothing else I wanted to bring this up if it's an issue to have the discrepancy between the two current versions.

alistairewj commented 1 year ago

Definitely an inconsistency, good spot. I can confirm BigQuery is giving integers after calculating a floor. Immediately I think the easiest fix is to add a FLOOR() call to each of the postgresql functions which emulate BigQuery behaviour (which can be done in a separate PR). I'll raise an issue for that. I think there are a few queries where this would obviously impact the output as it's generated from a DATETIME_DIFF call (age, icustay_detail), and some queries where it's more subtle (e.g. urine_output_rate).

For duckdb I think return whole integers is fine.