Apweiler data needs to be uploaded to backend expression database

dondi / GRNsight

Web app and service for modeling and visualizing gene regulatory networks.

http://dondi.github.io/GRNsight

BSD 3-Clause "New" or "Revised" License

17 stars 8 forks source link

Apweiler data needs to be uploaded to backend expression database #971

Closed kdahlquist closed 1 year ago

kdahlquist commented 1 year ago

I thought that we had put the Apweiler data that @ahmad00m worked on in our backend database. However I'm not seeing it on beta.

kdahlquist commented 1 year ago

@ahmad00m gave the data to @Onariaginosa , but it is not uploaded to database yet.

dondi commented 1 year ago

This now works as a helpful follow-up to #988 —Apweiler data can be loaded to a local copy first in order to test/debug; then once validated, the data can be loaded into the production server

ahmad00m commented 1 year ago

I'm still working on this issue but I have a logistics question with regards to loading the data to the database.

I was wondering what I should put for the taxon_id, sample_id, and possibly the dataset.

kdahlquist commented 1 year ago

don't need to include snoRNA genes in expression database
taxon id is for species so use the same one "559292" that can be used for S288c and BY4741
taxon id for W303 strain of yeast is 580240; W303 is 15% different than S288C, so warrants it's own id

kdahlquist commented 1 year ago

Following up, we won't change the taxon IDs for yeast right now because it would require changing the database schema. That part of it is now referenced in issue #994

dondi commented 1 year ago

@ahmad00m reported a loading issue which turned out to be a COPY format divergence

Upon re-running, a genuine missing gene ID was then found; @ahmad00m will look at it and consult with @kdahlquist as needed

ahmad00m commented 1 year ago

@ahmad00m investigated the missing gene ID's and found that even though some are genuine mitochondrial genes, not all are. Some of them are involved in the regulation of phospholipid metabolism and other things. I deleted more than 2 dozen of these genes but the issue doesn't seem to get fixed so, I'm hoping to see what my next step would be in loading this data. Perhaps I could use the genes in our database as a template and remove any ID that doesn't match what we have in the database, but we could discuss this more during our meeting this week.

kdahlquist commented 1 year ago

Can @ahmad00m give examples of the genes he removed?

ahmad00m commented 1 year ago

I exported the gene data from the fall2021 schema (which is used on GRNsight) and used that file as a reference to check whether the genes in the Apweiler data were already on our database. I found out there are 129 genes in the Apweiler data that are not present in our database, however I need to find a way to find the standard id's for these genes so then I'm able to input them into the gene table. I'm thinking maybe I can use the reference gene id's file that I got from YEASTRACT and then find their respective standard id's from that file and create a csv file that I can later use to load to the database.

dondi commented 1 year ago

@Onariaginosa suggests looking up standard IDs in SGD as well

dondi commented 1 year ago

☝🏽@kdahlquist agrees

ahmad00m commented 1 year ago

I finished writing the scripts that were not in our database and I uploaded them in GRNsight-archive repository and finished the documentation for loading data and it can be found HERE. I would just have to go over the naming conventions and cleaning up the code a little bit later this week/next week.

dondi commented 1 year ago

Just needs a top-down review now, then finally uploading into our AWS server

ahmad00m commented 1 year ago

The data is finally on the AWS server. Thanks for all the help from @Onariaginosa!

kdahlquist commented 1 year ago

This is complete.