Update differential gene expression database table using v10 results

logstar commented 2 years ago

Current differential gene expression (DGE) database table was built using v9 DESeq results, which is incompatible with the TPM database table that was built using v10 TPM data.

The DGE database table needs to be updated using v10 DESeq results to be compatible with the v10 TPM database table, because the TPM boxplots on the sides of the DGE heatmaps are generated using the TPM database table.

cc @taylordm @chinwallaa @afarrel

logstar commented 2 years ago

Following is an update on this issue:

I have updated the database using v10 differential gene expression (DGE) data shared by @sangeetashukla.
I have also fixed some issues in DGE heatmap plotting functions:
- I fixed a database error that is caused by R DBI package in commit https://github.com/PediatricOpenTargets/OpenPedCan-api/pull/63/commits/0cddead6e7e1afd4eba7248b8190fb4e304ff130. The query time also reduced from about 8 seconds to 4 seconds.
- I subsetted top genes to fit in the plot, if the queried EFO ID is mapped to two or more cancer groups, in commit https://github.com/PediatricOpenTargets/OpenPedCan-api/pull/63/commits/bc0853db0111d7eb062bf55acce121435bc21c5b.

I have currently been optimizing the v10 DGE database table. Some top DGE heatmaps take about 20 seconds to generate. I have been looking into the time consuming step and trying new database indexes. The progress is tracked in the v10-dge branch.

Let me know if you have any questions or suggestions.

cc @taylordm @chinwallaa @afarrel

logstar commented 2 years ago

The response times for top DGE heatmaps have been reduced to about 6 seconds, after I removed an EFO index in the TPM database table in commit https://github.com/PediatricOpenTargets/OpenPedCan-api/commit/f3de1d2a3d514dc4a7f621eaf4cc83742978b098.

v10 database request response times

It is counter-intuitive that the extra EFO index significantly slowed the TPM queries for generating top DGE heatmaps, as EFO ID is also used in the following query:

https://github.com/PediatricOpenTargets/OpenPedCan-api/blob/299265d52db0d94dc8f69be0ecfa55d6146c030b/src/get_one_efo_top_ensg_diff_exp_heatmap.R#L186-L195

Therefore, in future database development, the performance needs to be tested, before adding extra indexes.

I will be working on the following items:

Update README.md and changelog.md for v10 DGE database.
Open a pull request.
Update remote database servers using v10 DGE database.

Let me know if you have questions or suggestions.

cc @taylordm @chinwallaa @afarrel

chinwallaa commented 2 years ago

@logstar Thats great. Also not sure why removing sorted index would decrease query optimization ? determining which fields to index, null values, etc and how many indexes is a whole area of db management/optimization - especially at the scale we are dealing with.. :) Shipping may have some db optimization experience so we could also check with him if, if needed. one advantage of using bigquery as a backend db vs postgress is that it supposedly removes need for indexing ?? We could also check with the FNL team to see if they have db SMEs that can help with optimization?

logstar commented 2 years ago

@logstar Thats great. Also not sure why removing sorted index would decrease query optimization ? determining which fields to index, null values, etc and how many indexes is a whole area of db management/optimization - especially at the scale we are dealing with.. :) Shipping may have some db optimization experience so we could also check with him if, if needed. one advantage of using bigquery as a backend db vs postgress is that it supposedly removes need for indexing ?? We could also check with the FNL team to see if they have db SMEs that can help with optimization?

@chinwallaa Thank you for the suggestions. I agree the optimization requires specialized knowledge and experiences.

I think we are currently good with the PostgreSQL database performance. Each plot takes < 10 seconds to generate, for all plotting endpoints. Therefore, I think we currently would not need to have Shipping to work on optimizing the API database.

The API is also a stop-gap for an optimal solution that uses bigquery as database backend and javascript as plotting frontend.

When API reached to its performance limit, we could discuss other options and coordinate with the FNL team.

cc @taylordm @afarrel

PediatricOpenTargets / OpenPedCan-api

Update differential gene expression database table using v10 results #64