dondi / GRNsight

Web app and service for modeling and visualizing gene regulatory networks.
http://dondi.github.io/GRNsight
BSD 3-Clause "New" or "Revised" License
17 stars 8 forks source link

Implement workbook export with user-chosen expression datasets #938

Closed dondi closed 2 years ago

dondi commented 2 years ago

Implement the user interface and functionality for choosing expression datasets to include in an exported workbook.

This task aligns with #935

For maximum flexibility, the user will be allowed to choose from any available dataset, regardless of whether they have used or selected it.

But restrict the selections only to one source at a time, because expression data from multiple sources may not co-exist well, especially with GRNmap.

Onariaginosa commented 2 years ago

New Database SchemaIMG_20210930_165010138.jpgIMG_20210930_165002214.jpgIMG_20210930_164958650.jpg

Onariaginosa commented 2 years ago

I created the beginning functionality of the Export Modal. It looks like this so far: Screenshot from 2021-10-07 11-03-41

dondi commented 2 years ago

@Onariaginosa’s transcribed schema was reviewed and some corrections were made; running copy is stored in GRNsight-archive at https://github.com/dondi/GRNsight-archive/blob/main/documents/SDF/CMSI_4071/fall_2021/Onariaginosa%20Igbinedion/Updated%20Expression%20Database%20Schema.pdf

Work will continue on the UI and database work will continue in parallel

dondi commented 2 years ago

@Onariaginosa has the export dialog largely done and can move on to querying the database to export selected expression data. To do that, @Onariaginosa will test via beta in order to have access to the AWS managed instance.

Onariaginosa commented 2 years ago

I was able to export workbooks with the specified expression sheets as expected, but then I ran into a new problem. Initially it was not recognizing exported expression sheets because they didn't add the suffixes. Screenshot from 2021-10-21 14-40-21

I fixed this, but then the re-imported sheets had the error that the expression sheet data did not have the same genes as those in the expression sheet.
Screenshot from 2021-10-21 14-35-22

I changed it back so that it wouldn't error out, but this should be fixed

dondi commented 2 years ago

@Onariaginosa and @dondi looked this and traced the root cause to a missing id header row on the problematic expression data sheet. However, the sheet also didn't have any data; this issue may be related. @Onariaginosa will investigate.

dondi commented 2 years ago

@Onariaginosa fixed an issue—it was a straightforward typo and fixing the typo seems to resolve the issue. The typo appears to have resulted in an empty query result, but that begs the question of making sure that empty query results, which might happen one day for legitimate reasons, still avoid exporting an incompatible workbook.

Onariaginosa commented 2 years ago

I created an invalid database option to query to see if we get a bad response. I then was able to check each response and make sure that they were valid (had both gene data and time point data). I created the Onariaginosa_temporary branch because beta is currently on pug and attempting to switch to the Onariaginosa branch (when ssh-ed into beta) results in errors. I pulled from beta in the temporary branch in order to use pug, and developed on that branch. These changes can be seen in pull request #940.

Onariaginosa commented 2 years ago

As I was doing the preprocessing of the files for the new database schema, I noticed that the degradation rates and the production rates don't have a pubmed_id associated with them, but in our new schema, we add a pubmed_id foreign key and I don't know which foreign key these rates are associated with. Additionally, the Dahlquist data is not in the expression_metadata table, or the refs table because we don't have a pubmed_id associated with it.

kdahlquist commented 2 years ago

Ah. I have not published the Dahlquist Lab data yet, so it won't have a PubMed ID (sigh). It does have a GEO ID. We may need to use the GEO ID instead and allow the PubMed ID to be null in the case of the data from my lab. I'm pretty sure I can pull GEO IDs for the other datasets as well. The degradation rates have a PubMed ID, I just need to find it. The production rates do not because they were calculated using the degradation rates. I'm not sure what to do about that.

Onariaginosa commented 2 years ago

Kk, I'll hold off on uploading the production rates and degradation rates to the database, and I'll change the primary key of refs (and all associated foreign keys) from pubmed_id to ncbi_geo_id, and post an updated schema for confirmation.

kdahlquist commented 2 years ago

Here is the citation for the degradation rates:

PMID: 25161313

Neymotin, B., Athanasiadou R., and Gresham D. (2014). Determination of in vivo RNA kinetics using RATE-seq. RNA, 20, 1645-1652. doi: 10.1261/rna.045104.114

Unfortunately, there is not a GEO ID for this paper/dataset. Let's talk about it at the meeting.

kdahlquist commented 2 years ago

Barreto GEO ID: GSE24712 Kitagawa GEO ID: GSE9336 Thorsen GEO ID: GSE6068

kdahlquist commented 2 years ago

Dahlquist lab data: GEO ID: GSE83656

Onariaginosa commented 2 years ago

20211028_162128.jpg

dondi commented 2 years ago

@Onariaginosa will apply the latest revisions to the schema on her branch and when ready we will explore ways to apply these changes to our live AWS database server.

Onariaginosa commented 2 years ago

I added the fall2021 database to the aws database. On the Onariaginosa_temporary branch, I changed the query to match our current internal database structure, but I was having trouble accessing the results of said query, or even checking to see what the query returns.

dondi commented 2 years ago

@Onariaginosa will continue to learn some Sequelize to work out how to access query results and data; she will also provide sample of the expression metadata table so we can populate that table with appropriate values for the Dahlquist data.

Onariaginosa commented 2 years ago

Needed for the expression_metadata table: replicate_index | ncbi_geo_id | pubmed_id | control_yeast_strain | treatment_yeast_strain | control | treatment | concentration_value | concentration_unit | time_value | time_unit | number_of_replicates | expression_table

Onariaginosa commented 2 years ago
           1 | GSE9336     | 12269742  | S288C                | S288C                  | No Thiuram   | Thiuram   |                  75 | uM                 |         15 | m         |                    3 | Kitagawa_2002_log2_expression   | 
           2 | GSE9336     | 12269742  | S288C                | S288C                  | No Thiuram   | Thiuram   |                  75 | uM                 |         30 | m         |                    3 | Kitagawa_2002_log2_expression   | 
           3 | GSE9336     | 12269742  | S288C                | S288C                  | No Thiuram   | Thiuram   |                  75 | uM                 |        120 | m         |                    3 | Kitagawa_2002_log2_expression   | 
           1 | GSE6129     | 17327492  | W303-1A              | W303-1A                | No Arsenite  | Arsenite  |                   1 | mM                 |         60 | m         |                    6 | Thorsen_2007_log2_expression    | 
           2 | GSE6129     | 17327492  | W303-1A              | W303-1A                | No Arsenite  | Arsenite  |                   1 | mM                 |         15 | m         |                    3 | Thorsen_2007_log2_expression    | 
           3 | GSE6129     | 17327492  | W303-1A              | W303-1A                | No Arsenite  | Arsenite  |                   1 | mM                 |         30 | m         |                    3 | Thorsen_2007_log2_expression    | 
           4 | GSE6129     | 17327492  | W303-1A              | W303-1A                | No Arsenite  | Arsenite  |                   1 | mM                 |       1080 | m         |                    3 | Thorsen_2007_log2_expression    | 
           1 | GSE24712    | 23039231  | BY4741               | BY4741                 | No Potassium | Potassium |                  50 | mM                 |         10 | m         |                    2 | Barreto_2012_wt_log2_expression | 
           2 | GSE24712    | 23039231  | BY4741               | BY4741                 | No Potassium | Potassium |                  50 | mM                 |         20 | m         |                    4 | Barreto_2012_wt_log2_expression | 
           3 | GSE24712    | 23039231  | BY4741               | BY4741                 | No Potassium | Potassium |                  50 | mM                 |         40 | m         |                    4 | Barreto_2012_wt_log2_expression | 
           4 | GSE24712    | 23039231  | BY4741               | BY4741                 | No Potassium | Potassium |                  50 | mM                 |         60 | m         |                    4 | Barreto_2012_wt_log2_expression | 
           5 | GSE24712    | 23039231  | BY4741               | BY4741                 | No Potassium | Potassium |                  50 | mM                 |        120 | m         |                    4 | Barreto_2012_wt_log2_expression | 
Onariaginosa commented 2 years ago

Everything works now. We query the fall2021 database and the results are the same as when we query in the master branch. Exported results are as expected. I initially found a bug where the gene mappings were incorrect, but when I looked at the source data, the Thorsen_2007 data had incorrect gene names. Because that came last, what I did was I collected the gene data from all of the other sources first, the the Thorsen data last, so that I would get the correct mappings from the earlier sources. This worked properly, and now we have the same database functionality that we did in the spring 2020 Expression Database. The pull request from the export auto updated, so it is now valid once more.

Onariaginosa commented 2 years ago

To ensure that the Expression Data in the database is correct I will check that the original Thorsen data from the BioDB class data to track down where the data glitch came from.

Onariaginosa commented 2 years ago

The glitch appeared in the original Thorsen data. Screenshot from 2021-12-02 09-41-25 YJR139C -> HOM6 YNL015W -> PBI2

dondi commented 2 years ago

“That’s very done” — @Onariaginosa