Database setup - Githubissues

afiller1 commented 4 years ago

FINISH BY 2/6/2020

[x] Verify what data means, what data needs to go where, how everything will be organized (schedule a separate meeting for this?)
[x] Complete detailed database design

FINISH BY 2/10/2020

[x] Figure out configurations of RDS instance (meet/discuss with Dondi)
[x] Start instance of RDS with decided configurations
[x] Upload data to RDS

GENERAL

[x] Clean/fix database data
[x] Update database design as necessary

afiller1 commented 4 years ago

Got AWS Educate account going. I have $75 in credits, and no instances running yet.
Looks like anything we’re dealing with will only be about $0.10 an hour, at a high estimate.
What engine do we want to use?

Aurora
Amazon Aurora is a MySQL and PostgreSQL compatible enterprise-class database, starting at <$1/day.
Up to 5 times the throughput of MySQL and 3 times the throughput of PostgreSQL
Up to 64TB of auto-scaling SSD storage
6-way replication across three Availability Zones
Up to 15 Read Replicas with sub-10ms replica lag
Automatic monitoring and failover in less than 30 seconds

MySQL
MySQL is the most popular open source database in the world. MySQL on RDS offers the rich features of the MySQL community edition with the flexibility to easily scale compute resources or storage capacity for your database.
Supports database size up to 64 TiB.
Supports General Purpose, Memory Optimized, and Burstable Performance instance classes.
Supports automated backup and point-in-time recovery.
Supports up to 5 Read Replicas per instance, within a single Region or cross-region.

MariaDB
MariaDB Community Edition is a MySQL-compatible database with strong support from the open source community, and extra features and performance optimizations.
Supports database size up to 64 TiB.
Supports General Purpose, Memory Optimized, and Burstable Performance instance classes.
Supports automated backup and point-in-time recovery.
Supports up to 5 Read Replicas per instance, within a single Region or cross-region.
Supports global transaction ID (GTID) and thread pooling.
Developed and supported by the MariaDB open source community.

PostgreSQL
PostgreSQL is a powerful, open-source object-relational database system with a strong reputation of reliability, stability, and correctness.
High reliability and stability in a variety of workloads.
Advanced features to perform in high-volume environments.
Vibrant open-source community that releases new features multiple times per year.
Supports multiple extensions that add even more functionality to the database.
Supports General Purpose, Memory Optimized, and Burstable Performance instance classes.
The most Oracle-compatible open-source database.

Oracle
Oracle® Database is a relational database management system developed by Oracle. Amazon RDS makes it easy to set up, operate, and scale Oracle Database deployments in the cloud. With Amazon RDS, you can deploy multiple editions of Oracle Database in minutes with cost-efficient and re-sizable hardware capacity.

Microsoft MySQL Server
SQL Server is a relational database management system developed by Microsoft. Amazon RDS for SQL Server makes it easy to set up, operate, and scale SQL Server deployments in the cloud. With Amazon RDS, you can deploy multiple editions of SQL Server (2012, 2014, 2016, and 2017) including Express, Web, Standard and Enterprise, in minutes with cost-efficient and re-sizable compute capacity.
Further parameters will depend on what we select
https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_SettingUp.html
RDS will be more useful than DynamoDB in our scenario
https://www.sumologic.com/blog/migrating-aws-database-services/

afiller1 commented 4 years ago

The Amazon RDS Free Tier is available to you for 12 months. Each calendar month, the free tier will allow you to use the Amazon RDS resources listed below for free: 750 hrs of Amazon RDS in a Single-AZ db.t2.micro Instance. 20 GB of General Purpose Storage (SSD). 20 GB for automated backup storage and any user-initiated DB Snapshots.

afiller1 commented 4 years ago

need to install key pair (already downloaded)
get familiar with psql

kdahlquist commented 4 years ago

This wiki page has instructions for building a GRNmap workbook: https://xmlpipedb.cs.lmu.edu/biodb/fall2019/index.php/Week_10 so @afiller1 can see the queries performed by the students using GRNmap.

afiller1 commented 4 years ago

Notes from my meeting with Dr. Dahlquist in an attempt to draw out how the database will look: database_layout Contains fields, how the tables correspond to one another, attempts at normalizing, highlighting spots that may require more work to normalize, and ways that we may want to move the data around in the actual database. I'm hoping to stop by @dondi 's office hours on Tuesday to draw out a finalized ERD for how we want this to work, so that I can get to uploading the data to our RDS instance by the middle of the week.

kdahlquist commented 4 years ago

Copied over from #826.

I posted on Box the Microsoft Access database produced by the students in the Biological Databases course last semester, along with the ReadMe's written by the three groups. The ReadMe's may not be so helpful since there weren't clear guidelines for them, but they are there. You should have received an invite to edit to your lion e-mail address.

The idea is that there will be several type of tables:

several expression tables that hold the actual expression data from my lab and other published data
some metadata tables that describe the other data
a degradation rates table with degradation rates
a production rates table with initial guesses for production rates

the latter two are for GRNmap.

There are at least five tasks, as I see it (@dondi will help scope this for your 402)

review the database design and finalize it
port it to whatever is compatible with AWS
connect it to GRNsight so it can be used to color nodes
design a user interface so the user can choose the data they want to visualize
be able to export to an Excel workbook in GRNmap-compatible format

afiller1 commented 4 years ago

I'm trying to upload data to RDS using a CSV format, because you can copy data into the database using psql using a CSV and I wanted to try it out. A small problem: the RDS documentation says "You can run the \copy command from the psql prompt to import data into a table on a PostgreSQL DB instance. The table must already exist on the DB instance. " The table exists on the DB instance (the skeleton of the table, anyway), but it seems as though I need to find some way to upload the CSV to the instance? I'm not sure how to do that, and I can't find the answer to that exact question... I'm running the command:

\copy refs from 'References.csv' with delimiter ',';

According to the documentation, this is the correct syntax. I've tried several different ways of stating my link to the References.csv file, but I'm not sure how to make it connected to my instance...since I'm running the command while ssh'd into my db instance. Just need a little help with the troubleshooting @dondi ! (Documentation link is here.)

This week, I have also cleaned up my idea for the database design (will attach an image). In the coming week, I will finish uploading a the data to AWS to have a rough draft of the database, and I will start writing APIs so that we can connect the database to GRNsight.

afiller1 commented 4 years ago

database_layout_2

afiller1 commented 4 years ago

Rename Publication Info table to Sources
One publication can have many GEOID's, so there may be replicated data if we use GEOID as our primary key. (Dr. Dahlquist is ok with this)
Data in the student database is not right for sample-to-data (metadata) table. So we need to fix that up... by having the column name in the expression sheet be the primary key in the metadata table.
Transpose expression tables to make a new, relational expression table that actually contains ALL expression data from all publications.

database_layout_3

afiller1 commented 4 years ago

I have transposed the data as we outlined last week to produce a rough draft of our expression data table. I am hoping to go over what I have done in respect to this at the next meeting. I will upload the Excel workbooks / csv's that contain the data that I have reconfigured via Box. I have also figured out how to upload data to the RDS instance by uploading csv files to the EC2 instance and then reading it in from there. So, all in all, I did a ton of data manipulation on our expression data to compile it into one sheet, and I have uploaded all of the rough drafts of this data to the RDS instance. In the coming week, I will clean up that data as we want and start working on APIs with the hope of getting at least one API working by next week (with functionality so that the database data is accessible via GRNsight).

kdahlquist commented 4 years ago

What @afiller1 needs from me:

GEO IDs
review the data in Box (double-check data quality)

Note that @afiller1 did a lot of manipulation of the data by hand. @kdahlquist still needs to check the data quality and is concerned that this might have to be done again if the data changes. Also, in the future we would add new datasets. Can some type of helper function be written to format the data?

She has questions about datatypes for @dondi.

She wants to go over the APIs she needs to write; what functionality and what methods.

The basic use case would be if a user uploads an unweighted network and wants to get the production rates, degradation rates, and expression data for exporting to a GRNmap excel workbook. The list of genes will be in the json object and then queries will be sent to the database to grab the data for those genes.

Currently there is only one set of degradation rates, and the initial guesses for the production rates, but there could be more in the future. Currently there are 4 expression datasets to choose from (and the Dahlquist set has multiple strains in it).

So we need to ask the user, which expression data they want (drop-down menu?)

kdahlquist commented 4 years ago

There are some yeast standard names with special characters, commas and apostrophes. Currently @afiller1 turned the commas into periods for the csv. Might need to switch to tab-delimited, though.

afiller1 commented 4 years ago

I reformatted the data in the database a bit so that it's more useful in making the queries that will actually be made. Specifically, I added a column with simplified dataset names corresponding to which dataset the data comes from, and I added a time points column that contains actual numbers (15, 30, 60, etc) instead of the long label that came from the initial data column headers (so instead of BY4741_potassium_50mM_LogFC_t40m-4, I just have the timepoints column say 40, which corresponds to the time point of this entry).

dondi commented 4 years ago

Now that the database code is in beta, we can shift to a permanent server that can keep running permanently. @afiller1 will coordinate with Masao to get a new instance up on GRNsight’s final account then point the beta server to that instance.

afiller1 commented 4 years ago

The new RDS instance is connected to beta and running. It contains the most up-to-date data, which I will upload to the GRNsight shared box. I propose closing this issue and opening separate, specific issues if any problems occur with this aspect of my project. @dondi @kdahlquist

kdahlquist commented 4 years ago

Closing!

dondi / GRNsight

Database setup #836

FINISH BY 2/6/2020

FINISH BY 2/10/2020

GENERAL

Aurora

MySQL

MariaDB

PostgreSQL

Oracle

Microsoft MySQL Server