PediatricOpenTargets / OpenPedCan-api

2 stars 7 forks source link

Build bulk differential gene expression database tables #59

Closed logstar closed 1 year ago

logstar commented 2 years ago

Build bulk differential gene expression (DGE) database tables to provide data for DGE heatmap and table endpoints described in #37 .

OpenPedCan-analysis tumor-normal-differential-expression module output DESeq_Results_V9_v2.RDS is used to build the bulk DGE database tables. The module is currently under review at https://github.com/PediatricOpenTargets/OpenPedCan-analysis/pull/114. The module will be updated to use OpenPedCan-analysis v10 data and add relapse tumor samples, as described in https://github.com/PediatricOpenTargets/ticket-tracker/issues/260.

The following two bulk DGE database tables may need to be built:

The top-gene table may be necessary, because querying all genes in one EFO ID is too slow, which takes about 2 minutes. The query is slow, because the query result has several million of rows.

To speed up top gene queries, another option is to add a gene ranking column to the all-gene table, and add a conditional clause in SQL query to select top genes, as suggested by @afarrel. Although the query speed of this option is hard to predict, because all genes in the table still have to be scanned, it is very fast to query genes with >= 10 log 2 fold change for one EFO ID, which takes about 2 seconds. Therefore, I will try implementing this option first.

cc @taylordm @chinwallaa @afarrel

logstar commented 2 years ago

I have implemented one database table for querying bulk differential gene expression (DGE) results.

I have also implemented two R functions to query the following bulk DGE results:

In one-EFO-top-ENSG bulk DGE query results, genes are ranked by mean log2 fold change of a (cancer_group, cohort) tuple comparing to each GTEx tissue. get_one_efo_top_ensg_diff_exp_tbl can query top up-regulated, down-regulated, or up-and-down-regulated genes among all genes or only PMTL genes. The ranking procedure is implemented with the following code:

https://github.com/PediatricOpenTargets/OpenPedCan-api/blob/c0855eb1691969d63a8a60a154f7ee0bae3d94f4/db/build_tools/build_db.R#L450-L496

The implemented gene ranking procedure is a short term solution to quickly develop the general framework of the DGE heatmap and table endpoints. This short term solution is different from @afarrel's gene ranking procedure, because @afarrel's gene ranking procedure needs extensive optimization via HPC parallel computation to rank all genes within a feasible amount of time. Without optimization, it takes > 1 day to rank all genes for one (cancer_group, cohort) tuple.

In future releases, best-practice gene ranking procedure should be programmed in OpenPedCan-analysis repository, together with best-practice batch effect correction procedure.

cc @taylordm @chinwallaa @afarrel