Open tompollard opened 1 year ago
OK, so this is really interesting.
We need some kind of standardized file structure that defines how to map the data files to a relational DB. Does such a standard exist at present? I think I've seen scripts in mimic-code that basically amount to "find every file with a .csv.gz suffix, dump them into pandas to magically guess column types, and create tables according to the file names." Which is great if it works, but maybe we can be a little more rigorous? :)
Definitely I think this is something we should have, though I also strongly feel it should not be tied to any particular proprietary services. The data format description, schema, whatever you call it, it needs to be something that can be consumed by any cloud service or that people can use on their own local DB server.
We need some kind of standardized file structure that defines how to map the data files to a relational DB. Does such a standard exist at present? ... I also strongly feel it should not be tied to any particular proprietary services.
I think the Google GCP JSON schema isn't far off what we need. There's nothing explicitly proprietary about it. We could use our own "non-proprietary" version and map it to external database systems like GCP.
Here is an example GCP schema file that I created for the apacheapsvar
table in the eCU dataset that @jraffa is working on.
[
{
"name": "apacheapsvarid",
"type": "INTEGER",
"mode": "REQUIRED"
},
{
"name": "patientunitstayid",
"type": "INTEGER",
"mode": "REQUIRED"
},
{
"name": "intubated",
"type": "BOOLEAN",
"mode": "NULLABLE"
},
{
"name": "vent",
"type": "BOOLEAN",
"mode": "NULLABLE"
},
{
"name": "dialysis",
"type": "BOOLEAN",
"mode": "NULLABLE"
},
{
"name": "eyes",
"type": "INTEGER",
"mode": "NULLABLE"
},
{
"name": "motor",
"type": "INTEGER",
"mode": "NULLABLE"
},
{
"name": "verbal",
"type": "INTEGER",
"mode": "NULLABLE"
},
{
"name": "meds",
"type": "INTEGER",
"mode": "NULLABLE"
},
{
"name": "urine",
"type": "FLOAT",
"mode": "NULLABLE"
},
{
"name": "wbc",
"type": "FLOAT",
"mode": "NULLABLE"
},
{
"name": "temperature",
"type": "FLOAT",
"mode": "NULLABLE"
},
{
"name": "respiratoryrate",
"type": "FLOAT",
"mode": "NULLABLE"
},
{
"name": "sodium",
"type": "FLOAT",
"mode": "NULLABLE"
},
{
"name": "heartrate",
"type": "FLOAT",
"mode": "NULLABLE"
},
{
"name": "meanbp",
"type": "FLOAT",
"mode": "NULLABLE"
},
{
"name": "ph",
"type": "FLOAT",
"mode": "NULLABLE"
},
{
"name": "hematocrit",
"type": "FLOAT",
"mode": "NULLABLE"
},
{
"name": "creatinine",
"type": "FLOAT",
"mode": "NULLABLE"
},
{
"name": "albumin",
"type": "FLOAT",
"mode": "NULLABLE"
},
{
"name": "pao2",
"type": "FLOAT",
"mode": "NULLABLE"
},
{
"name": "pco2",
"type": "FLOAT",
"mode": "NULLABLE"
},
{
"name": "bun",
"type": "FLOAT",
"mode": "NULLABLE"
},
{
"name": "glucose",
"type": "FLOAT",
"mode": "NULLABLE"
},
{
"name": "bilirubin",
"type": "FLOAT",
"mode": "NULLABLE"
},
{
"name": "fio2",
"type": "FLOAT",
"mode": "NULLABLE"
}
]
If we had something similar to this that:
1) specified all tables in a single file 2) specified the database name 3) specified the table names
...then I think this would meet our need?
Sure, that's similar to what I'm thinking. I think we probably would want to explicitly list the input files ("hosp/admissions.csv.gz" corresponds to table "admissions" or whatever).
What about foreign keys or other indexed columns? Arguably indexing is a matter of optimization and kind of out-of-scope, but in practice it's something users may want to know? Foreign-key relationships are useful as documentation and for validating the data.
What about foreign keys or other indexed columns? Arguably indexing is a matter of optimization and kind of out-of-scope, but in practice it's something users may want to know?
This would be a nice to have, but perhaps as a feature we add further down the line? I think the basic build functionality is something we could implement pretty easily, as long as we stay simple.
I did created a YAML file for structuring the MIMIC schema as an aside. I explored doing something like this a whole back. AFAIK we would just have to invent something that works for us since I didn't find an existing solution. The general approach I had was (1) a fairly specific initial config, and (2) dialect specific adapters. For example, you specify columns as VARCHAR(25), and then the BigQuery adapter converts instances of VARCHAR to TEXT (since BigQuery doesn't support VARCHAR directly).
Indexes and constraints get hard since they aren't generic, e.g. they don't exist for BigQuery.
Cool. Could you share that yaml file? Do you think it would be easy to write similar files for the other databases on PhysioNet?
@alistairewj are you doing anything similar with the HDN research environment (is there an automated process for setting up databases/servers in the research environment or is it all done manually)? Seems like there is a lot of potential overlap.
The other thought I had was that we could also generate something like https://mimic.mit.edu/ automatically - online browsable documentation of every table and column.
Yeah for sure. Here is the schema I created (converted to .txt so I can upload it in the issue):
Snippet for readability:
- table_name: d_hcpcs
module: hosp
columns:
- name: code
type: CHAR(5)
nullable: false
- name: category
type: SMALLINT
nullable: true
- name: long_description
type: TEXT
nullable: true
- name: short_description
type: VARCHAR(180)
nullable: true
- table_name: diagnoses_icd
module: hosp
columns:
- name: subject_id
type: INTEGER
nullable: false
...
It would be fairly easy to write this for the other databases.
Right now we don't have an automated process for building datasets on BigQuery. We do it manually, but it could certainly be done automatically. For generating docs, I did make a few schema spy pages, which use a tool called SchemaSpy automatically run on a postgres DB to make some useful docs:
https://mit-lcp.github.io/mimic-schema-spy/
... but these require constraints to be useful (as they define the relations), so we are back to square one :)
There's probably another tool that exists and does something simpler like generate static markdown docs for a DB schema.
Just adding a note to highlight this package for generating the BigQuery Schema: https://pypi.org/project/bigquery-schema-generator/
Generates a separate JSON metadata file for each table in BigQuery syntax.
If I understand correctly, the separate MIMIC-IV "modules" (core
and hosp
) are currently published as two separate "datasets" within the top-level "project" physionet-data
. I understand the usefulness of dividing it into logical pieces, but this seems at odds with how we manage projects on the PhysioNet level. Do we have a clear idea of how the namespace will be managed?
We also need to be thinking about versioning. We shouldn't be silently replacing version 2.1 with version 2.2 while somebody is in the middle of running an analysis. At the same time, does anybody want to have version numbers scattered all throughout their SQL code? Yuck.
I was previously thinking that we should try to explicitly list the input .csv.gz
files, but maybe it'd be better to mandate a fixed directory structure that maps directly to the schema file. Would that cause a problem for any existing published databases?
If I understand correctly, the separate MIMIC-IV "modules" (
core
andhosp
) are currently published as two separate "datasets" within the top-level "project"physionet-data
. I understand the usefulness of dividing it into logical pieces, but this seems at odds with how we manage projects on the PhysioNet level. Do we have a clear idea of how the namespace will be managed?We also need to be thinking about versioning. We shouldn't be silently replacing version 2.1 with version 2.2 while somebody is in the middle of running an analysis. At the same time, does anybody want to have version numbers scattered all throughout their SQL code? Yuck.
I tried once upon a time to email users of BigQuery that v2.0 was being replaced with v2.2. I couldn't get any of the Google Groups to actually e-mail out because they are PhysioNet gsuite groups. I gave up. We have mimic_hosp
with MIMIC-IV v2.0, and mimiciv_hosp
with MIMIC-IV v2.2, and this system is terrible. Maybe a separate issue though.
The current version of MIMIC-IV v2.2 has mimiciv_hosp
and mimiciv_icu
so your point stands.
I was previously thinking that we should try to explicitly list the input
.csv.gz
files, but maybe it'd be better to mandate a fixed directory structure that maps directly to the schema file. Would that cause a problem for any existing published databases?
The only problem I can think of is the BigQuery dataset rules are specific and conflict with project names (e.g. no hyphens).
We also need to be thinking about versioning. We shouldn't be silently replacing version 2.1 with version 2.2 while somebody is in the middle of running an analysis. At the same time, does anybody want to have version numbers scattered all throughout their SQL code? Yuck.
Yeah, this is a problem right now, and one of the benefits of moving to an approach managed through PhysioNet. Right now, we have no established naming convention.
@bemoody @alistairewj Based on a quick chat with Benjamin, my suggestion is that we require users to include db-schema.yml
in the root of their project folder. This needs to point to (1) table schema files (2) table data files.
# db-schema.yml
# Metadata
build_info:
project_name: "MIMIC-IV"
description: "Database build schema for MIMIC-IV"
# Databases
databases:
- name: "bigquery"
version: "latest"
config:
dataset_id: "dataset_id"
tables:
- table_name: "table1"
table_schema: "schemas/bigquery/table1_schema.sql"
table_data: "data/bigquery/table1_data.csv"
- table_name: "table2"
table_schema: "schemas/bigquery/table2_schema.sql"
table_data: "data/bigquery/table2_data.csv"
The idea is that the entire file would be specified by the user, except I think we want to automatically populate the dataset_id
(which needs to be unique within the project namespace).
The dataset_id would probably be a combination of slug, version, and subfolder. e.g. For MIMIC-IV, there would be 2 bigquery entries, one for hosp and one for icu:
# db-schema.yml
# Metadata
build_info:
project_name: "MIMIC-IV"
description: "Database build schema for MIMIC-IV"
# Databases
databases:
- name: "bigquery"
version: "latest"
config:
dataset_id: "dataset_id_icu"
tables:
- table_name: "table1"
table_schema: "schemas/bigquery/table1_schema.sql"
table_data: "data/bigquery/table1_data.csv"
- table_name: "table2"
table_schema: "schemas/bigquery/table2_schema.sql"
table_data: "data/bigquery/table2_data.csv"
- name: "bigquery"
version: "latest"
config:
dataset_id: "dataset_id_hosp"
tables:
- table_name: "table1"
table_schema: "schemas/bigquery/table1_schema.sql"
table_data: "data/bigquery/table1_data.csv"
- table_name: "table2"
table_schema: "schemas/bigquery/table2_schema.sql"
table_data: "data/bigquery/table2_data.csv"
Actually, not good that name: "bigquery"
is repeated, so ideally restructure slightly to avoid that.
Maybe:
# db-schema.yml
# Metadata
build_info:
project_name: "MIMIC-IV"
description: "Database build schema for MIMIC-IV"
# Databases
database_services:
- service: "bigquery"
version: "latest"
databases:
- config:
dataset_id: "dataset_id_icu"
tables:
- table_name: "table1"
table_schema: "schemas/bigquery/table1_schema.sql"
table_data: "data/bigquery/table1_data.csv"
- table_name: "table2"
table_schema: "schemas/bigquery/table2_schema.sql"
table_data: "data/bigquery/table2_data.csv"
- config:
dataset_id: "dataset_id_hosp"
tables:
- table_name: "table1"
table_schema: "schemas/bigquery/table1_schema.sql"
table_data: "data/bigquery/table1_data.csv"
- table_name: "table2"
table_schema: "schemas/bigquery/table2_schema.sql"
table_data: "data/bigquery/table2_data.csv"
The only problem I can think of is the BigQuery dataset rules are specific and conflict with project names (e.g. no hyphens).
I think this is the relevant doc? https://cloud.google.com/bigquery/docs/reference/rest/v2/datasets#Dataset
datasetId: string Required. A unique ID for this dataset, without the project name. The ID must contain only letters (a-z, A-Z), numbers (0-9), or underscores (_). The maximum length is 1,024 characters.
I also notice this:
isCaseInsensitive: boolean Optional. TRUE if the dataset and its table names are case-insensitive, otherwise FALSE. By default, this is FALSE, which means the dataset and its table names are case-sensitive. This field does not affect routine references
Does that mean that "mimiciv_hosp" and "MimicIV_Hosp" would be two different datasets?
I can see a few options for how to name things. Let's consider the mimic-iv-demo
project. We could map this to:
mimic_iv_demo_hosp_v2_2
and mimic_iv_demo_icu_v2_2
: potentially ambiguous (since somebody else could create a project called mimic-iv-demo-hosp
.)
mimicivdemo_hosp_v2_2
and mimicivdemo_icu_v2_2
: less readable but less ambiguous. We could add a rule that project slugs must be unique after stripping hyphens.
mimic_iv_demo__hosp__v2_2
and mimic_iv_demo__icu__v2_2
MimicIvDemo_hosp_v2_2
and MimicIvDemo_icu_v2_2
table_schema: "schemas/bigquery/table1_schema.sql"
Hmm. I didn't realize that BQ actually allowed defining a schema using SQL.
I'd actually prefer to have the schema defined declaratively, using a format like Alistair's example above - which could be transformed into SQL for different database engines if desired.
I'd actually prefer to have the schema defined declaratively, using a format like Alistair's example above - which could be transformed into SQL for different database engines if desired.
I'd be okay with this, but I think it is more work because there is so much inconsistency between database systems.
Taking the format Alistair posted above and converting that into a JSON file that can be passed to bq
sounds pretty easy to me.
I'm not suggesting to implement transformations for other database engines. I'm suggesting to leave room for people to do that in the future.
I also would want to have something that we could validate locally before the project is published.
There are a lot of reasons for wanting a declarative format and I don't see any real downsides. Are there features of BigQuery SQL that wouldn't be captured in Alistair's YAML format?
Taking https://github.com/MIT-LCP/physionet-build/issues/2058#issuecomment-1688345842 and converting that into a JSON file that can be passed to bq sounds pretty easy to me.
No problem, we can go with the declarative approach. I'm just saying it's a little more work to implement (for me at least). I can dump all of our existing schema from BigQuery and then we're basically done for existing databases. Coming up with a format that outputs the existing schema will take a little more thought.
@bemoody How about this?:
# db-schema.yml
# Metadata
build_info:
project_name: "MIMIC-IV"
description: "Declarative database schema definition for MIMIC-IV"
# schema_prefix will be appended with the module name
# and version, e.g. `mimic_iv_hosp_v2_2`
schema_prefix: "mimic-iv-"
# Modules
modules:
- name: hosp
tables:
- table_name: d_hcpcs
data_file: "data/hosp/d_hcpcs.csv"
columns:
- name: code
type: CHAR(5)
nullable: false
- name: category
type: SMALLINT
nullable: true
- name: long_description
type: TEXT
nullable: true
- name: short_description
type: VARCHAR(180)
nullable: true
- table_name: diagnoses_icd
data_file: "data/hosp/diagnoses_icd.csv"
columns:
- name: subject_id
type: INTEGER
nullable: false
- name: icd_code
type: VARCHAR(10)
nullable: false
- name: icd_version
type: SMALLINT
nullable: true
- name: seq_num
type: INTEGER
nullable: true
- name: icu
tables:
- table_name: icustays
data_file: "data/icu/icustays.csv"
columns:
- name: stay_id
type: INTEGER
nullable: false
- name: intime
type: TIMESTAMP
nullable: false
- name: outtime
type: TIMESTAMP
nullable: true
- name: length_of_stay
type: FLOAT
nullable: true
# Configurations for specific database services (optional)
database_services:
- service: bigquery
version: "latest"
type_mappings:
CHAR: "STRING"
SMALLINT: "INTEGER"
TEXT: "STRING"
VARCHAR: "STRING"
Does that mean that "mimiciv_hosp" and "MimicIV_Hosp" would be two different datasets?
Possibly, but I don't think there is any good reason to support casing. I'd like to convert everything to lower case for simplicity.
I can see a few options for how to name things. Let's consider the mimic-iv-demo project. We could map this to:
mimic_iv_demo_hosp_v2_2 and mimic_iv_demo_icu_v2_2: potentially ambiguous (since somebody else could create a project called mimic-iv-demo-hosp.)
mimicivdemo_hosp_v2_2 and mimicivdemo_icu_v2_2: less readable but less ambiguous. We could add a rule that project slugs must be unique after stripping hyphens.
mimic_iv_demohospv2_2 and mimic_iv_demoicuv2_2
MimicIvDemo_hosp_v2_2 and MimicIvDemo_icu_v2_2
My preference would be to update our slug validation rules to ensure uniqueness after converting all special characters to _
(or something like this).
How about this?
I think there's room to simplify! "Perfection is achieved when there is nothing left to take away."
On the other hand, I'd also say that each of the modules, tables, and columns should allow an optional "description".
"schema_prefix" isn't necessary. The dataset name should be determined by projct slug, module name, and version, nothing else.
I don't think that data type mappings for BigQuery belong here either. The data types should describe the data using a short list of supported types, and the BigQuery importer defines how to translate those to BigQuery.
We should add a
Send dataset to GCP
button to the management page (https://physionet.org/console/published-projects/SLUG/VERSION/) for published projects.Clicking the button would trigger a function that:
1) builds the database on the
physionet-data
project on BigQuery 2) sets the appropriate permissions on the databaseSomewhere there would need to be instructions for the build. A BigQuery JSON schema file may be enough?
See: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-json for sample code for uploading data to BigQuery.