Build bulk differential gene expression database tables

Build bulk differential gene expression (DGE) database tables to provide data for DGE heatmap and table endpoints described in #37 .

OpenPedCan-analysis tumor-normal-differential-expression module output DESeq_Results_V9_v2.RDS is used to build the bulk DGE database tables. The module is currently under review at https://github.com/PediatricOpenTargets/OpenPedCan-analysis/pull/114. The module will be updated to use OpenPedCan-analysis v10 data and add relapse tumor samples, as described in https://github.com/PediatricOpenTargets/ticket-tracker/issues/260.

The following two bulk DGE database tables may need to be built:

All-gene table. This table includes DESeq2 results of all genes and all EFO IDs. This table is used to query one gene ENSG ID and all EFO IDs. The query results can be used to generate DGE heatmaps with cancer_groups as rows and GTEx tissues as columns.
Top-gene table. This table includes DESeq2 results of top 50 up-regulated and top 50 down-regulated genes of each cancer_group. This table is used to query top genes of one EFO ID. The query results can be used to generate DGE heatmaps with top genes as rows and GTEx tissues as columns. The top genes will be selected based on @afarrel's procedure for ranking genes.

The top-gene table may be necessary, because querying all genes in one EFO ID is too slow, which takes about 2 minutes. The query is slow, because the query result has several million of rows.

To speed up top gene queries, another option is to add a gene ranking column to the all-gene table, and add a conditional clause in SQL query to select top genes, as suggested by @afarrel. Although the query speed of this option is hard to predict, because all genes in the table still have to be scanned, it is very fast to query genes with >= 10 log 2 fold change for one EFO ID, which takes about 2 seconds. Therefore, I will try implementing this option first.

cc @taylordm @chinwallaa @afarrel

I have implemented one database table for querying bulk differential gene expression (DGE) results.

I have also implemented two R functions to query the following bulk DGE results:

one-ENSG-all-EFO bulk DGE results are queried using get_one_ensg_all_efo_diff_exp_tbl. Each query takes about 0.1 second. One-ENSG-all-EFO bulk DGE query results are used to generate heatmaps with cancer_groups as rows and GTEx tissues as columns.
one-EFO-top-ENSG bulk DGE results are queried using get_one_efo_top_ensg_diff_exp_tbl. Each query takes about 0.2-0.6 second, mainly depending on the number of cancer_groups and cohorts of the queried EFO ID. The query time probably could not be optimized without extensively refactoring the bulk DGE data model. One-EFO-top-ENSG bulk DGE query results are used to generate heatmaps with genes as rows and GTEx tissues as columns.

In one-EFO-top-ENSG bulk DGE query results, genes are ranked by mean log2 fold change of a (cancer_group, cohort) tuple comparing to each GTEx tissue. get_one_efo_top_ensg_diff_exp_tbl can query top up-regulated, down-regulated, or up-and-down-regulated genes among all genes or only PMTL genes. The ranking procedure is implemented with the following code:

https://github.com/PediatricOpenTargets/OpenPedCan-api/blob/c0855eb1691969d63a8a60a154f7ee0bae3d94f4/db/build_tools/build_db.R#L450-L496

The implemented gene ranking procedure is a short term solution to quickly develop the general framework of the DGE heatmap and table endpoints. This short term solution is different from @afarrel's gene ranking procedure, because @afarrel's gene ranking procedure needs extensive optimization via HPC parallel computation to rank all genes within a feasible amount of time. Without optimization, it takes > 1 day to rank all genes for one (cancer_group, cohort) tuple.

In future releases, best-practice gene ranking procedure should be programmed in OpenPedCan-analysis repository, together with best-practice batch effect correction procedure.

cc @taylordm @chinwallaa @afarrel

PediatricOpenTargets / OpenPedCan-api

Build bulk differential gene expression database tables #59