dondi / GRNsight

Web app and service for modeling and visualizing gene regulatory networks.
http://dondi.github.io/GRNsight
BSD 3-Clause "New" or "Revised" License
17 stars 8 forks source link

Database setup #836

Closed afiller1 closed 4 years ago

afiller1 commented 4 years ago

FINISH BY 2/6/2020

FINISH BY 2/10/2020

GENERAL

afiller1 commented 4 years ago
afiller1 commented 4 years ago

The Amazon RDS Free Tier is available to you for 12 months. Each calendar month, the free tier will allow you to use the Amazon RDS resources listed below for free: 750 hrs of Amazon RDS in a Single-AZ db.t2.micro Instance. 20 GB of General Purpose Storage (SSD). 20 GB for automated backup storage and any user-initiated DB Snapshots.

afiller1 commented 4 years ago
kdahlquist commented 4 years ago

This wiki page has instructions for building a GRNmap workbook: https://xmlpipedb.cs.lmu.edu/biodb/fall2019/index.php/Week_10 so @afiller1 can see the queries performed by the students using GRNmap.

afiller1 commented 4 years ago

Notes from my meeting with Dr. Dahlquist in an attempt to draw out how the database will look: database_layout Contains fields, how the tables correspond to one another, attempts at normalizing, highlighting spots that may require more work to normalize, and ways that we may want to move the data around in the actual database. I'm hoping to stop by @dondi 's office hours on Tuesday to draw out a finalized ERD for how we want this to work, so that I can get to uploading the data to our RDS instance by the middle of the week.

kdahlquist commented 4 years ago

Copied over from #826.

I posted on Box the Microsoft Access database produced by the students in the Biological Databases course last semester, along with the ReadMe's written by the three groups. The ReadMe's may not be so helpful since there weren't clear guidelines for them, but they are there. You should have received an invite to edit to your lion e-mail address.

The idea is that there will be several type of tables:

the latter two are for GRNmap.

There are at least five tasks, as I see it (@dondi will help scope this for your 402)

afiller1 commented 4 years ago

I'm trying to upload data to RDS using a CSV format, because you can copy data into the database using psql using a CSV and I wanted to try it out. A small problem: the RDS documentation says "You can run the \copy command from the psql prompt to import data into a table on a PostgreSQL DB instance. The table must already exist on the DB instance. " The table exists on the DB instance (the skeleton of the table, anyway), but it seems as though I need to find some way to upload the CSV to the instance? I'm not sure how to do that, and I can't find the answer to that exact question... I'm running the command:

\copy refs from 'References.csv' with delimiter ',';

According to the documentation, this is the correct syntax. I've tried several different ways of stating my link to the References.csv file, but I'm not sure how to make it connected to my instance...since I'm running the command while ssh'd into my db instance. Just need a little help with the troubleshooting @dondi ! (Documentation link is here.)

This week, I have also cleaned up my idea for the database design (will attach an image). In the coming week, I will finish uploading a the data to AWS to have a rough draft of the database, and I will start writing APIs so that we can connect the database to GRNsight.

afiller1 commented 4 years ago

database_layout_2

afiller1 commented 4 years ago

database_layout_3

afiller1 commented 4 years ago

I have transposed the data as we outlined last week to produce a rough draft of our expression data table. I am hoping to go over what I have done in respect to this at the next meeting. I will upload the Excel workbooks / csv's that contain the data that I have reconfigured via Box. I have also figured out how to upload data to the RDS instance by uploading csv files to the EC2 instance and then reading it in from there. So, all in all, I did a ton of data manipulation on our expression data to compile it into one sheet, and I have uploaded all of the rough drafts of this data to the RDS instance. In the coming week, I will clean up that data as we want and start working on APIs with the hope of getting at least one API working by next week (with functionality so that the database data is accessible via GRNsight).

kdahlquist commented 4 years ago

What @afiller1 needs from me:

Note that @afiller1 did a lot of manipulation of the data by hand. @kdahlquist still needs to check the data quality and is concerned that this might have to be done again if the data changes. Also, in the future we would add new datasets. Can some type of helper function be written to format the data?

She has questions about datatypes for @dondi.

She wants to go over the APIs she needs to write; what functionality and what methods.

The basic use case would be if a user uploads an unweighted network and wants to get the production rates, degradation rates, and expression data for exporting to a GRNmap excel workbook. The list of genes will be in the json object and then queries will be sent to the database to grab the data for those genes.

Currently there is only one set of degradation rates, and the initial guesses for the production rates, but there could be more in the future. Currently there are 4 expression datasets to choose from (and the Dahlquist set has multiple strains in it).

So we need to ask the user, which expression data they want (drop-down menu?)

kdahlquist commented 4 years ago

There are some yeast standard names with special characters, commas and apostrophes. Currently @afiller1 turned the commas into periods for the csv. Might need to switch to tab-delimited, though.

afiller1 commented 4 years ago

I reformatted the data in the database a bit so that it's more useful in making the queries that will actually be made. Specifically, I added a column with simplified dataset names corresponding to which dataset the data comes from, and I added a time points column that contains actual numbers (15, 30, 60, etc) instead of the long label that came from the initial data column headers (so instead of BY4741_potassium_50mM_LogFC_t40m-4, I just have the timepoints column say 40, which corresponds to the time point of this entry).

dondi commented 4 years ago

Now that the database code is in beta, we can shift to a permanent server that can keep running permanently. @afiller1 will coordinate with Masao to get a new instance up on GRNsight’s final account then point the beta server to that instance.

afiller1 commented 4 years ago

The new RDS instance is connected to beta and running. It contains the most up-to-date data, which I will upload to the GRNsight shared box. I propose closing this issue and opening separate, specific issues if any problems occur with this aspect of my project. @dondi @kdahlquist

kdahlquist commented 4 years ago

Closing!