MIT-LCP / physionet-build

The new PhysioNet platform.
https://physionet.org/
BSD 3-Clause "New" or "Revised" License
56 stars 20 forks source link

Add "Send dataset to BigQuery" button to published project management page #2058

Open tompollard opened 1 year ago

tompollard commented 1 year ago

We should add a Send dataset to GCP button to the management page (https://physionet.org/console/published-projects/SLUG/VERSION/) for published projects.

Clicking the button would trigger a function that:

1) builds the database on the physionet-data project on BigQuery 2) sets the appropriate permissions on the database

Somewhere there would need to be instructions for the build. A BigQuery JSON schema file may be enough?

See: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-json for sample code for uploading data to BigQuery.

bemoody commented 1 year ago

OK, so this is really interesting.

We need some kind of standardized file structure that defines how to map the data files to a relational DB. Does such a standard exist at present? I think I've seen scripts in mimic-code that basically amount to "find every file with a .csv.gz suffix, dump them into pandas to magically guess column types, and create tables according to the file names." Which is great if it works, but maybe we can be a little more rigorous? :)

Definitely I think this is something we should have, though I also strongly feel it should not be tied to any particular proprietary services. The data format description, schema, whatever you call it, it needs to be something that can be consumed by any cloud service or that people can use on their own local DB server.

tompollard commented 1 year ago

We need some kind of standardized file structure that defines how to map the data files to a relational DB. Does such a standard exist at present? ... I also strongly feel it should not be tied to any particular proprietary services.

I think the Google GCP JSON schema isn't far off what we need. There's nothing explicitly proprietary about it. We could use our own "non-proprietary" version and map it to external database systems like GCP.

Here is an example GCP schema file that I created for the apacheapsvar table in the eCU dataset that @jraffa is working on.

[
    {
        "name": "apacheapsvarid",
        "type": "INTEGER",
        "mode": "REQUIRED"
    },
    {
        "name": "patientunitstayid",
        "type": "INTEGER",
        "mode": "REQUIRED"
    },
    {
        "name": "intubated",
        "type": "BOOLEAN",
        "mode": "NULLABLE"
    },
    {
        "name": "vent",
        "type": "BOOLEAN",
        "mode": "NULLABLE"
    },
    {
        "name": "dialysis",
        "type": "BOOLEAN",
        "mode": "NULLABLE"
    },
    {
        "name": "eyes",
        "type": "INTEGER",
        "mode": "NULLABLE"
    },
    {
        "name": "motor",
        "type": "INTEGER",
        "mode": "NULLABLE"
    },
    {
        "name": "verbal",
        "type": "INTEGER",
        "mode": "NULLABLE"
    },
    {
        "name": "meds",
        "type": "INTEGER",
        "mode": "NULLABLE"
    },
    {
        "name": "urine",
        "type": "FLOAT",
        "mode": "NULLABLE"
    },
    {
        "name": "wbc",
        "type": "FLOAT",
        "mode": "NULLABLE"
    },
    {
        "name": "temperature",
        "type": "FLOAT",
        "mode": "NULLABLE"
    },
    {
        "name": "respiratoryrate",
        "type": "FLOAT",
        "mode": "NULLABLE"
    },
    {
        "name": "sodium",
        "type": "FLOAT",
        "mode": "NULLABLE"
    },
    {
        "name": "heartrate",
        "type": "FLOAT",
        "mode": "NULLABLE"
    },
    {
        "name": "meanbp",
        "type": "FLOAT",
        "mode": "NULLABLE"
    },
    {
        "name": "ph",
        "type": "FLOAT",
        "mode": "NULLABLE"
    },
    {
        "name": "hematocrit",
        "type": "FLOAT",
        "mode": "NULLABLE"
    },
    {
        "name": "creatinine",
        "type": "FLOAT",
        "mode": "NULLABLE"
    },
    {
        "name": "albumin",
        "type": "FLOAT",
        "mode": "NULLABLE"
    },
    {
        "name": "pao2",
        "type": "FLOAT",
        "mode": "NULLABLE"
    },
    {
        "name": "pco2",
        "type": "FLOAT",
        "mode": "NULLABLE"
    },
    {
        "name": "bun",
        "type": "FLOAT",
        "mode": "NULLABLE"
    },
    {
        "name": "glucose",
        "type": "FLOAT",
        "mode": "NULLABLE"
    },
    {
        "name": "bilirubin",
        "type": "FLOAT",
        "mode": "NULLABLE"
    },
    {
        "name": "fio2",
        "type": "FLOAT",
        "mode": "NULLABLE"
    }
]

If we had something similar to this that:

1) specified all tables in a single file 2) specified the database name 3) specified the table names

...then I think this would meet our need?

bemoody commented 1 year ago

Sure, that's similar to what I'm thinking. I think we probably would want to explicitly list the input files ("hosp/admissions.csv.gz" corresponds to table "admissions" or whatever).

What about foreign keys or other indexed columns? Arguably indexing is a matter of optimization and kind of out-of-scope, but in practice it's something users may want to know? Foreign-key relationships are useful as documentation and for validating the data.

tompollard commented 1 year ago

What about foreign keys or other indexed columns? Arguably indexing is a matter of optimization and kind of out-of-scope, but in practice it's something users may want to know?

This would be a nice to have, but perhaps as a feature we add further down the line? I think the basic build functionality is something we could implement pretty easily, as long as we stay simple.

alistairewj commented 1 year ago

I did created a YAML file for structuring the MIMIC schema as an aside. I explored doing something like this a whole back. AFAIK we would just have to invent something that works for us since I didn't find an existing solution. The general approach I had was (1) a fairly specific initial config, and (2) dialect specific adapters. For example, you specify columns as VARCHAR(25), and then the BigQuery adapter converts instances of VARCHAR to TEXT (since BigQuery doesn't support VARCHAR directly).

Indexes and constraints get hard since they aren't generic, e.g. they don't exist for BigQuery.

bemoody commented 1 year ago

Cool. Could you share that yaml file? Do you think it would be easy to write similar files for the other databases on PhysioNet?

bemoody commented 1 year ago

@alistairewj are you doing anything similar with the HDN research environment (is there an automated process for setting up databases/servers in the research environment or is it all done manually)? Seems like there is a lot of potential overlap.

The other thought I had was that we could also generate something like https://mimic.mit.edu/ automatically - online browsable documentation of every table and column.

alistairewj commented 1 year ago

Yeah for sure. Here is the schema I created (converted to .txt so I can upload it in the issue):

mimic_iv_schema.yaml.txt

Snippet for readability:

- table_name: d_hcpcs
  module: hosp
  columns:
    - name: code
      type: CHAR(5)
      nullable: false
    - name: category
      type: SMALLINT
      nullable: true
    - name: long_description
      type: TEXT
      nullable: true
    - name: short_description
      type: VARCHAR(180)
      nullable: true
- table_name: diagnoses_icd
  module: hosp
  columns:
    - name: subject_id
      type: INTEGER
      nullable: false
...

It would be fairly easy to write this for the other databases.

Right now we don't have an automated process for building datasets on BigQuery. We do it manually, but it could certainly be done automatically. For generating docs, I did make a few schema spy pages, which use a tool called SchemaSpy automatically run on a postgres DB to make some useful docs:

https://mit-lcp.github.io/mimic-schema-spy/

... but these require constraints to be useful (as they define the relations), so we are back to square one :)

There's probably another tool that exists and does something simpler like generate static markdown docs for a DB schema.

tompollard commented 1 year ago

Just adding a note to highlight this package for generating the BigQuery Schema: https://pypi.org/project/bigquery-schema-generator/

Generates a separate JSON metadata file for each table in BigQuery syntax.

bemoody commented 1 year ago

If I understand correctly, the separate MIMIC-IV "modules" (core and hosp) are currently published as two separate "datasets" within the top-level "project" physionet-data. I understand the usefulness of dividing it into logical pieces, but this seems at odds with how we manage projects on the PhysioNet level. Do we have a clear idea of how the namespace will be managed?

We also need to be thinking about versioning. We shouldn't be silently replacing version 2.1 with version 2.2 while somebody is in the middle of running an analysis. At the same time, does anybody want to have version numbers scattered all throughout their SQL code? Yuck.

I was previously thinking that we should try to explicitly list the input .csv.gz files, but maybe it'd be better to mandate a fixed directory structure that maps directly to the schema file. Would that cause a problem for any existing published databases?

alistairewj commented 1 year ago

If I understand correctly, the separate MIMIC-IV "modules" (core and hosp) are currently published as two separate "datasets" within the top-level "project" physionet-data. I understand the usefulness of dividing it into logical pieces, but this seems at odds with how we manage projects on the PhysioNet level. Do we have a clear idea of how the namespace will be managed?

We also need to be thinking about versioning. We shouldn't be silently replacing version 2.1 with version 2.2 while somebody is in the middle of running an analysis. At the same time, does anybody want to have version numbers scattered all throughout their SQL code? Yuck.

I tried once upon a time to email users of BigQuery that v2.0 was being replaced with v2.2. I couldn't get any of the Google Groups to actually e-mail out because they are PhysioNet gsuite groups. I gave up. We have mimic_hosp with MIMIC-IV v2.0, and mimiciv_hosp with MIMIC-IV v2.2, and this system is terrible. Maybe a separate issue though.

The current version of MIMIC-IV v2.2 has mimiciv_hosp and mimiciv_icu so your point stands.

I was previously thinking that we should try to explicitly list the input .csv.gz files, but maybe it'd be better to mandate a fixed directory structure that maps directly to the schema file. Would that cause a problem for any existing published databases?

The only problem I can think of is the BigQuery dataset rules are specific and conflict with project names (e.g. no hyphens).

tompollard commented 1 year ago

We also need to be thinking about versioning. We shouldn't be silently replacing version 2.1 with version 2.2 while somebody is in the middle of running an analysis. At the same time, does anybody want to have version numbers scattered all throughout their SQL code? Yuck.

Yeah, this is a problem right now, and one of the benefits of moving to an approach managed through PhysioNet. Right now, we have no established naming convention.

tompollard commented 1 week ago

@bemoody @alistairewj Based on a quick chat with Benjamin, my suggestion is that we require users to include db-schema.yml in the root of their project folder. This needs to point to (1) table schema files (2) table data files.

# db-schema.yml

# Metadata
build_info:
  project_name: "MIMIC-IV"
  description: "Database build schema for MIMIC-IV"

# Databases
databases:
  - name: "bigquery"
    version: "latest"
    config:
      dataset_id: "dataset_id"
    tables:
      - table_name: "table1"
        table_schema: "schemas/bigquery/table1_schema.sql"
        table_data: "data/bigquery/table1_data.csv"
      - table_name: "table2"
        table_schema: "schemas/bigquery/table2_schema.sql"
        table_data: "data/bigquery/table2_data.csv"

The idea is that the entire file would be specified by the user, except I think we want to automatically populate the dataset_id (which needs to be unique within the project namespace).

The dataset_id would probably be a combination of slug, version, and subfolder. e.g. For MIMIC-IV, there would be 2 bigquery entries, one for hosp and one for icu:

# db-schema.yml

# Metadata
build_info:
  project_name: "MIMIC-IV"
  description: "Database build schema for MIMIC-IV"

# Databases
databases:
  - name: "bigquery"
    version: "latest"
    config:
      dataset_id: "dataset_id_icu"
    tables:
      - table_name: "table1"
        table_schema: "schemas/bigquery/table1_schema.sql"
        table_data: "data/bigquery/table1_data.csv"
      - table_name: "table2"
        table_schema: "schemas/bigquery/table2_schema.sql"
        table_data: "data/bigquery/table2_data.csv"
  - name: "bigquery"
    version: "latest"
    config:
      dataset_id: "dataset_id_hosp"
    tables:
      - table_name: "table1"
        table_schema: "schemas/bigquery/table1_schema.sql"
        table_data: "data/bigquery/table1_data.csv"
      - table_name: "table2"
        table_schema: "schemas/bigquery/table2_schema.sql"
        table_data: "data/bigquery/table2_data.csv"
tompollard commented 1 week ago

Actually, not good that name: "bigquery" is repeated, so ideally restructure slightly to avoid that.

tompollard commented 1 week ago

Maybe:

# db-schema.yml

# Metadata
build_info:
  project_name: "MIMIC-IV"
  description: "Database build schema for MIMIC-IV"

# Databases
database_services:
  - service: "bigquery"
    version: "latest"
    databases:
      - config:
          dataset_id: "dataset_id_icu"
        tables:
          - table_name: "table1"
            table_schema: "schemas/bigquery/table1_schema.sql"
            table_data: "data/bigquery/table1_data.csv"
          - table_name: "table2"
            table_schema: "schemas/bigquery/table2_schema.sql"
            table_data: "data/bigquery/table2_data.csv"
      - config:
          dataset_id: "dataset_id_hosp"
        tables:
          - table_name: "table1"
            table_schema: "schemas/bigquery/table1_schema.sql"
            table_data: "data/bigquery/table1_data.csv"
          - table_name: "table2"
            table_schema: "schemas/bigquery/table2_schema.sql"
            table_data: "data/bigquery/table2_data.csv"
bemoody commented 1 week ago

The only problem I can think of is the BigQuery dataset rules are specific and conflict with project names (e.g. no hyphens).

I think this is the relevant doc? https://cloud.google.com/bigquery/docs/reference/rest/v2/datasets#Dataset

datasetId: string Required. A unique ID for this dataset, without the project name. The ID must contain only letters (a-z, A-Z), numbers (0-9), or underscores (_). The maximum length is 1,024 characters.

I also notice this:

isCaseInsensitive: boolean Optional. TRUE if the dataset and its table names are case-insensitive, otherwise FALSE. By default, this is FALSE, which means the dataset and its table names are case-sensitive. This field does not affect routine references

Does that mean that "mimiciv_hosp" and "MimicIV_Hosp" would be two different datasets?

I can see a few options for how to name things. Let's consider the mimic-iv-demo project. We could map this to:

bemoody commented 1 week ago

table_schema: "schemas/bigquery/table1_schema.sql"

Hmm. I didn't realize that BQ actually allowed defining a schema using SQL.

I'd actually prefer to have the schema defined declaratively, using a format like Alistair's example above - which could be transformed into SQL for different database engines if desired.

tompollard commented 1 week ago

I'd actually prefer to have the schema defined declaratively, using a format like Alistair's example above - which could be transformed into SQL for different database engines if desired.

I'd be okay with this, but I think it is more work because there is so much inconsistency between database systems.

bemoody commented 1 week ago

Taking the format Alistair posted above and converting that into a JSON file that can be passed to bq sounds pretty easy to me.

I'm not suggesting to implement transformations for other database engines. I'm suggesting to leave room for people to do that in the future.

I also would want to have something that we could validate locally before the project is published.

There are a lot of reasons for wanting a declarative format and I don't see any real downsides. Are there features of BigQuery SQL that wouldn't be captured in Alistair's YAML format?

tompollard commented 1 week ago

Taking https://github.com/MIT-LCP/physionet-build/issues/2058#issuecomment-1688345842 and converting that into a JSON file that can be passed to bq sounds pretty easy to me.

No problem, we can go with the declarative approach. I'm just saying it's a little more work to implement (for me at least). I can dump all of our existing schema from BigQuery and then we're basically done for existing databases. Coming up with a format that outputs the existing schema will take a little more thought.

tompollard commented 1 week ago

@bemoody How about this?:

# db-schema.yml

# Metadata
build_info:
  project_name: "MIMIC-IV"
  description: "Declarative database schema definition for MIMIC-IV"
  # schema_prefix will be appended with the module name
  # and version, e.g. `mimic_iv_hosp_v2_2`
  schema_prefix: "mimic-iv-"

# Modules
modules:
  - name: hosp
    tables:
      - table_name: d_hcpcs
        data_file: "data/hosp/d_hcpcs.csv"
        columns:
          - name: code
            type: CHAR(5)
            nullable: false
          - name: category
            type: SMALLINT
            nullable: true
          - name: long_description
            type: TEXT
            nullable: true
          - name: short_description
            type: VARCHAR(180)
            nullable: true
      - table_name: diagnoses_icd
        data_file: "data/hosp/diagnoses_icd.csv"
        columns:
          - name: subject_id
            type: INTEGER
            nullable: false
          - name: icd_code
            type: VARCHAR(10)
            nullable: false
          - name: icd_version
            type: SMALLINT
            nullable: true
          - name: seq_num
            type: INTEGER
            nullable: true
  - name: icu
    tables:
      - table_name: icustays
        data_file: "data/icu/icustays.csv"
        columns:
          - name: stay_id
            type: INTEGER
            nullable: false
          - name: intime
            type: TIMESTAMP
            nullable: false
          - name: outtime
            type: TIMESTAMP
            nullable: true
          - name: length_of_stay
            type: FLOAT
            nullable: true

# Configurations for specific database services (optional)
database_services:
  - service: bigquery
    version: "latest"
    type_mappings:
      CHAR: "STRING"
      SMALLINT: "INTEGER"
      TEXT: "STRING"
      VARCHAR: "STRING"
tompollard commented 1 week ago

Does that mean that "mimiciv_hosp" and "MimicIV_Hosp" would be two different datasets?

Possibly, but I don't think there is any good reason to support casing. I'd like to convert everything to lower case for simplicity.

I can see a few options for how to name things. Let's consider the mimic-iv-demo project. We could map this to:

mimic_iv_demo_hosp_v2_2 and mimic_iv_demo_icu_v2_2: potentially ambiguous (since somebody else could create a project called mimic-iv-demo-hosp.)

mimicivdemo_hosp_v2_2 and mimicivdemo_icu_v2_2: less readable but less ambiguous. We could add a rule that project slugs must be unique after stripping hyphens.

mimic_iv_demohospv2_2 and mimic_iv_demoicuv2_2

MimicIvDemo_hosp_v2_2 and MimicIvDemo_icu_v2_2

My preference would be to update our slug validation rules to ensure uniqueness after converting all special characters to _ (or something like this).

bemoody commented 5 days ago

How about this?

I think there's room to simplify! "Perfection is achieved when there is nothing left to take away."

On the other hand, I'd also say that each of the modules, tables, and columns should allow an optional "description".

"schema_prefix" isn't necessary. The dataset name should be determined by projct slug, module name, and version, nothing else.

I don't think that data type mappings for BigQuery belong here either. The data types should describe the data using a short list of supported types, and the BigQuery importer defines how to translate those to BigQuery.