OHDSI / CommonDataModel

Definition and DDLs for the OMOP Common Data Model (CDM)
https://ohdsi.github.io/CommonDataModel
903 stars 453 forks source link

Instructions for using sqlite DDLs for non-R users #693

Open barabo opened 6 months ago

barabo commented 6 months ago

Hi there!

I've been playing around trying to get the DDLs loaded for sqlite3, version 5.4 using the sqlite3 command line and I've been having a pretty rough go of it using the provided definitions.

I didn't find any documentation here or elsewhere about the @cdmDatabaseSchema notation in the ddl files, so I went ahead and removed them. I assume they are something used by the sqlrender utility - but I'm not an R user - I'm just trying to create a working OMOPCDM database in sqlite.

I also discovered that sqlite does not allow you to apply primary or foreign keys to tables using the ALTER TABLE commands provided. So, I wrote a brittle little script to apply the PK and FK definitions to the main _ddl.sql file.

This allows me to use the patched _ddl.sql file to define tables with PKs and FKs. The _indices.sql file seems fine as-is.

I did notice that the COHORT table is missing a PRIMARY KEY, so the foreign key check fails in a clean install for just that table. It almost seems like COHORT should have cohort_id integer PRIMARY KEY and COHORT_DEFINITION should have cohort_definition_id integer PRIMARY KEY - and then the COHORT table should have a FK on COHORT_DEFINITION.cohort_definition_id (instead, it seems backwards). But as it is now, if I make COHORT.cohort_definition_id the PK of the COHORT table - it gets me past the failing foreign key checks. Anyway, I expect some of that will change in future versions - my script includes a hack for this.

For posterity, I'm providing my script here. However, are there instructions for people who don't want to use R but just want to create a working DB out there somewhere?

Click to see my script ```sh #!/bin/bash # # File: update-ddl.sh # # Converts the OMOP-provided DDL database definition scripts into a format # that can be loaded by sqlite3. # # This has been tested against the provided sqlite 5.4 DDL files. # # To run this, place copies of the provided .sql files into this directory # and run the script. It will modify the sql files and create an empty # database called cdm.db # set -e set -o pipefail set -u # There should be only one DDL file in the local directory. DDL=$( ls OMOPCDM*_ddl.sql ) [ ! -e ${DDL} ] && echo "Failed to find an OMOPCDM*_ddl.sql schema!" && exit 1 # Awk scriptlet to add a PRIMARY KEY declaration to a table.column in a table # definition. # # Assumptions: # 'table' is padded with spaces # 'column' begins with a tab and ends with a space # AWK_PK=' /^CREATE TABLE / { if ($0 ~ table) { target=1 } else { target=0 } } { if (target && $0 ~ column && (!($0 ~ / PRIMARY KEY,/))) { sub(/,/, " PRIMARY KEY,"); } print $0 } ' # Awk scriptlet to add a FOREIGN KEY declaration to a table.column in a table # definition. # # Assumptions: # 'table' is padded with spaces # 'column' begins with a tab and ends with a space # 'ref' is multiple words long # AWK_FK=' /^CREATE TABLE / { if ($0 ~ table) { target=1 } else { target=0 } } { if (target && $0 ~ column && (!($0 ~ / REFERENCES /))) { if ($0 ~ /,$/) { sub(/,$/, " REFERENCES " ref ","); } else { sub(/ );$/, " REFERENCES " ref ");"); } } print $0 } ' ## # Gets the table names and PK columns from constraints. # function get_pks() { grep '^ALTER TABLE ' OMOPCDM*_primary_keys.sql \ | sed -e 's:.* \(.*\) ADD CONSTRAINT .* PRIMARY KEY .\(.*\).;:\1 \2:' } ## # Decorate the PK columns with PRIMARY KEY if not already decorated. # function mark_pk() { # Apply the change to the ddl file content, saving into a temp file. cat ${DDL} | awk -v table=" ${1} " -v column=" ${2} " "${AWK_PK}" \ > ${DDL}.tmp # diff the files to assert diff count is 1 changed line only. (( $( diff -by --suppress-common-lines ${DDL} ${DDL}.tmp | wc -l ) == 1 )) || exit 1 # Apply the temp file to the DDL file. mv ${DDL}.tmp ${DDL} } function get_fks() { grep '^ALTER TABLE ' OMOPCDM*_constraints.sql \ | sed -e 's:ALTER TABLE \(.*\) ADD CONSTRAINT .* FOREIGN KEY .\([^ ]*\). REFERENCES:\1 \2:' \ | sed -e 's:;::' } ## # Decorate the PK columns with PRIMARY KEY if not already decorated. # function mark_fk() { # Apply the change to the ddl file content, saving into a temp file. cat ${DDL} \ | awk -v table=" ${1} " -v column=" ${2} " -v ref="${3}" "${AWK_FK}" \ > ${DDL}.tmp # diff the files to assert diff count is 1 changed line only. (( $( diff -by --suppress-common-lines ${DDL} ${DDL}.tmp | wc -l ) == 1 )) || exit 1 # Apply the temp file to the DDL file. mv ${DDL}.tmp ${DDL} } # Remove the template schema reference, which isn't supported in sqlite. sed -i -e 's:@cdmDatabaseSchema.::g' OMOPCDM*.sql && rm -f OMOPCDM*.sql-e # Update the table definitions to insert the PRIMARY KEY declarations. echo "Adding PRIMARY KEYS to DDL" get_pks | while read table column; do mark_pk ${table} ${column} done # Insert the FK constraints into the table. echo "Adding FOREIGN KEYS to DDL" get_fks | while read table column reference; do mark_fk ${table} ${column} "${reference}" done # HACK: no PK in the cohort table, so this is needed to satisfy FK checks. mark_pk cohort cohort_definition_id # Populate the DB tables and indices. echo "Creating empty cdm.db database" rm -f cdm.db touch cdm.db sqlite3 cdm.db < ${DDL} sqlite3 cdm.db < OMOPCDM*_indices.sql sqlite3 cdm.db "pragma foreign_key_check" # Success! echo "DONE!" ``` This updates the `_ddl.sql` file to look like this: ```sql CREATE TABLE person ( person_id integer NOT NULL PRIMARY KEY, gender_concept_id integer NOT NULL REFERENCES CONCEPT (CONCEPT_ID), year_of_birth integer NOT NULL, month_of_birth integer NULL, day_of_birth integer NULL, birth_datetime REAL NULL, race_concept_id integer NOT NULL REFERENCES CONCEPT (CONCEPT_ID), ethnicity_concept_id integer NOT NULL REFERENCES CONCEPT (CONCEPT_ID), location_id integer NULL REFERENCES LOCATION (LOCATION_ID), provider_id integer NULL REFERENCES PROVIDER (PROVIDER_ID), care_site_id integer NULL REFERENCES CARE_SITE (CARE_SITE_ID), person_source_value TEXT NULL, gender_source_value TEXT NULL, gender_source_concept_id integer NULL REFERENCES CONCEPT (CONCEPT_ID), race_source_value TEXT NULL, race_source_concept_id integer NULL REFERENCES CONCEPT (CONCEPT_ID), ethnicity_source_value TEXT NULL, ethnicity_source_concept_id integer NULL REFERENCES CONCEPT (CONCEPT_ID)); ``` When the script runs correctly, you see output like this: ``` (base) anderson.carl3@home test % ./update-ddl.sh Adding PRIMARY KEYS to DDL Adding FOREIGN KEYS to DDL Creating empty cdm.db database DONE! ```
thisismexp commented 6 months ago

Thanks for your script, saved me a lot of effort. I extended your script to replace tabs with spaces:

```shell #!/bin/bash # # File: update-ddl.sh # # Converts the OMOP-provided DDL database definition scripts into a format # that can be loaded by sqlite3. # # This has been tested against the provided sqlite 5.4 DDL files. # # To run this, place copies of the provided .sql files into this directory # and run the script. It will modify the sql files and create an empty # database called cdm.db # set -e set -o pipefail set -u # @thisismexp: replace \t with spaces d="$(mktemp -d)" for filename in *.sql; do expand -t 4 "$filename" > "$d"/"$filename" done mv "$d"/* . rmdir "$d" # There should be only one DDL file in the local directory. DDL=$( ls OMOPCDM*_ddl.sql ) [ ! -e ${DDL} ] && echo "Failed to find an OMOPCDM*_ddl.sql schema!" && exit 1 # Awk scriptlet to add a PRIMARY KEY declaration to a table.column in a table # definition. # # Assumptions: # 'table' is padded with spaces # 'column' begins with a tab and ends with a space # AWK_PK=' /^CREATE TABLE / { if ($0 ~ table) { target=1 } else { target=0 } } { if (target && $0 ~ column && (!($0 ~ / PRIMARY KEY,/))) { sub(/,/, " PRIMARY KEY,"); } print $0 } ' # Awk scriptlet to add a FOREIGN KEY declaration to a table.column in a table # definition. # # Assumptions: # 'table' is padded with spaces # 'column' begins with a tab and ends with a space # 'ref' is multiple words long # AWK_FK=' /^CREATE TABLE / { if ($0 ~ table) { target=1 } else { target=0 } } { if (target && $0 ~ column && (!($0 ~ / REFERENCES /))) { if ($0 ~ /,$/) { sub(/,$/, " REFERENCES " ref ","); } else { sub(/ );$/, " REFERENCES " ref ");"); } } print $0 } ' ## # Gets the table names and PK columns from constraints. # function get_pks() { grep '^ALTER TABLE ' OMOPCDM*_primary_keys.sql \ | sed -e 's:.* \(.*\) ADD CONSTRAINT .* PRIMARY KEY .\(.*\).;:\1 \2:' } ## # Decorate the PK columns with PRIMARY KEY if not already decorated. # function mark_pk() { # Apply the change to the ddl file content, saving into a temp file. cat ${DDL} | awk -v table=" ${1} " -v column=" ${2} " "${AWK_PK}" \ > ${DDL}.tmp # diff the files to assert diff count is 1 changed line only. (( $( diff -by --suppress-common-lines ${DDL} ${DDL}.tmp | wc -l ) == 1 )) || exit 1 # Apply the temp file to the DDL file. mv ${DDL}.tmp ${DDL} } function get_fks() { grep '^ALTER TABLE ' OMOPCDM*_constraints.sql \ | sed -e 's:ALTER TABLE \(.*\) ADD CONSTRAINT .* FOREIGN KEY .\([^ ]*\). REFERENCES:\1 \2:' \ | sed -e 's:;::' } ## # Decorate the PK columns with PRIMARY KEY if not already decorated. # function mark_fk() { # Apply the change to the ddl file content, saving into a temp file. cat ${DDL} \ | awk -v table=" ${1} " -v column=" ${2} " -v ref="${3}" "${AWK_FK}" \ > ${DDL}.tmp # diff the files to assert diff count is 1 changed line only. (( $( diff -by --suppress-common-lines ${DDL} ${DDL}.tmp | wc -l ) == 1 )) || exit 1 # Apply the temp file to the DDL file. mv ${DDL}.tmp ${DDL} } # Remove the template schema reference, which isn't supported in sqlite. sed -i -e 's:@cdmDatabaseSchema.::g' OMOPCDM*.sql && rm -f OMOPCDM*.sql-e # Update the table definitions to insert the PRIMARY KEY declarations. echo "Adding PRIMARY KEYS to DDL" get_pks | while read table column; do mark_pk ${table} ${column} done # Insert the FK constraints into the table. echo "Adding FOREIGN KEYS to DDL" get_fks | while read table column reference; do mark_fk ${table} ${column} "${reference}" done # HACK: no PK in the cohort table, so this is needed to satisfy FK checks. mark_pk cohort cohort_definition_id # Populate the DB tables and indices. echo "Creating empty cdm.db database" rm -f cdm.db touch cdm.db sqlite3 cdm.db < ${DDL} sqlite3 cdm.db < OMOPCDM*_indices.sql sqlite3 cdm.db "pragma foreign_key_check" # Success! echo "DONE!"```
wardle commented 5 months ago

This is at an early stage of development- but would be grateful for any feedback - but this might help efforts with SQLite. See https://github.com/wardle/iort - it can now dynamically generate and execute DDL and handle the vocab import and is working for SQLite and PostgreSQL. I need to add customisation to choose different databases in a similar way to schema in PostgreSQL for example, but might help?

barabo commented 5 months ago

For those on this thread, or anyone who lands here. I'm working with a sqlite cdm database for a demo, which uses an experimental new jq module library to quickly map FHIR resources into cdm records.

Demo: https://github.com/barabo/fhir-to-omop-demo fhir-jq: https://github.com/barabo/fhir-jq

Both are still WIP, but this is what motivated me to update the DDL like this.