Build DNA methylation database schema and tables

Build DNA methylation database schema and tables using results generated from https://github.com/PediatricOpenTargets/OpenPedCan-analysis/pull/165. According to @ewafula, the results in https://github.com/PediatricOpenTargets/OpenPedCan-analysis/pull/165 are still under preparation.

Following is a brief introduction on how to build the DNA methylation database schema and tables. The implementation of these procedures may require further discussions and modifications, to coordinate with the development of DNA methylation API plot and table endpoints.

Add a new DNA methylation database schema to store all DNA methylation tables, to avoid table name collisions with existing tables in bulk_expression schema. The DNA methylation schema can have any appropriate name, and the following description uses dna_methylation as an example schema name. Following are the steps to add in a new dna_methylation schema:

Add DNA_METHYLATION_SCHEMA=dna_methylation in ../OpenPedCan-api-secrets/common_db.env.
Add ENV DNA_METHYLATION_SCHEMA="dna_methylation" in Dockerfile.

Add the following code in db/init_db.sh:

CREATE SCHEMA ${DNA_METHYLATION_SCHEMA};
GRANT SELECT ON ALL TABLES IN SCHEMA ${DNA_METHYLATION_SCHEMA} TO ${DB_USERNAME};
ALTER DEFAULT PRIVILEGES IN SCHEMA ${DNA_METHYLATION_SCHEMA} GRANT SELECT ON TABLES TO ${DB_USERNAME};

Add DNA_METHYLATION_SCHEMA = "DNA_METHYLATION_SCHEMA" in db/r_interfaces/db_env_vars.R.

Add the following code in db/load_db.sh:

GRANT USAGE ON SCHEMA ${DNA_METHYLATION_SCHEMA} TO ${DB_USERNAME};
GRANT SELECT ON ALL TABLES IN SCHEMA ${DNA_METHYLATION_SCHEMA} TO ${DB_USERNAME};
ALTER DEFAULT PRIVILEGES IN SCHEMA ${DNA_METHYLATION_SCHEMA} GRANT SELECT ON TABLES TO ${DB_USERNAME};

Add --schema="$DNA_METHYLATION_SCHEMA" in db/build_tools/build_db_docker_cmd.sh to include DNA_METHYLATION_SCHEMA in the database dump.

Add new DNA methylation tables to store DNA methylation results that are required to generate plots and tables for MTP. The DNA methylation tables can have any appropriate names, and the following description uses dna_methylation_summary as an example table name. Following are the steps to add a new dna_methylation_summary table:

Add DNA_METHYLATION_SUMMARY_TBL=dna_methylation_summary in ../OpenPedCan-api-secrets/common_db.env.
Add ENV DNA_METHYLATION_SUMMARY_TBL="dna_methylation_summary" in Dockerfile.
Add DNA_METHYLATION_SUMMARY_TBL = "DNA_METHYLATION_SUMMARY_TBL" in db/r_interfaces/db_env_vars.R.

Create an empty table in db/build_tools/build_db.R:

# dna_methylation_summary_tbl is assumed to be the table for loading into the database.
#
# dna_methylation_summary_tbl does not have to be the full table, and it can only have zero or a few lines.
db_write_table(
dplyr::slice(dna_methylation_summary_tbl, 0), conn,
tolower(db_env_vars$DNA_METHYLATION_SCHEMA),
tolower(db_env_vars$DNA_METHYLATION_SUMMARY_TBL))

Prepare a CSV file for loading into the database. The CSV files can either be prepared in https://github.com/PediatricOpenTargets/OpenPedCan-analysis/pull/165 or db/build_tools/build_db.R.
- If prepared in https://github.com/PediatricOpenTargets/OpenPedCan-analysis/pull/165:
- Upload the prepared CSV file to a S3 bucket as new OpenPedCan-analysis data release.
- The prepared CSV file will be downloaded to OpenPedCan-analysis/data/ by db/build_db.sh.
- If prepared in db/build_tools/build_db.R:
- Upload a result file to a S3 bucket as new OpenPedCan-analysis data release. The result file can be in any format, which needs further preparation before loading into the database.
- The result file will be downloaded to OpenPedCan-analysis/data/ by db/build_db.sh.
- Load the result file in db/build_tools/build_db.R.
- Prepare the table for database loading, such as filtering, renaming columns, and reordering columns.
- Output the prepared table as CSV file like the procedure in db/build_tools/build_db.R.

Add the following code in db/build_tools/build_db_docker_cmd.sh to load the prepared CSV file to the database:

# Assuming the prepared CSV file is at ${BUILD_OUTPUT_DIR_PATH}/${DNA_METHYLATION_SCHEMA}_${DNA_METHYLATION_SUMMARY_TBL}.csv
psql -v ON_ERROR_STOP=1 --username "$POSTGRES_USER" --dbname "$DB_NAME" <<EOSQL
COPY ${DNA_METHYLATION_SCHEMA}.${DNA_METHYLATION_SUMMARY_TBL}
FROM '${BUILD_OUTPUT_DIR_PATH}/${DNA_METHYLATION_SCHEMA}_${DNA_METHYLATION_SUMMARY_TBL}.csv'
WITH (FORMAT csv, HEADER);
EOSQL

Add the following code in db/build_tools/build_db_docker_cmd.sh to add a table index on the column that will be queried:

echo "CREATE INDEX ensg_id_idx ON ${DNA_METHYLATION_SCHEMA}.${DNA_METHYLATION_SUMMARY_TBL} (\"Gene_Ensembl_ID\");" \
| gzip --no-name -c >> "$db_dump_out_path"

The updated database building procedure can be tested using ./db/build_db.sh.

cc @afarrel @chinwallaa

PediatricOpenTargets / OpenPedCan-api

Build DNA methylation database schema and tables #69