Closed afiller1 closed 4 years ago
Got AWS Educate account going. I have $75 in credits, and no instances running yet.
Looks like anything we’re dealing with will only be about $0.10 an hour, at a high estimate.
What engine do we want to use?
Amazon Aurora is a MySQL and PostgreSQL compatible enterprise-class database, starting at <$1/day.
Up to 5 times the throughput of MySQL and 3 times the throughput of PostgreSQL
Up to 64TB of auto-scaling SSD storage
6-way replication across three Availability Zones
Up to 15 Read Replicas with sub-10ms replica lag
Automatic monitoring and failover in less than 30 seconds
MySQL is the most popular open source database in the world. MySQL on RDS offers the rich features of the MySQL community edition with the flexibility to easily scale compute resources or storage capacity for your database.
Supports database size up to 64 TiB.
Supports General Purpose, Memory Optimized, and Burstable Performance instance classes.
Supports automated backup and point-in-time recovery.
Supports up to 5 Read Replicas per instance, within a single Region or cross-region.
MariaDB Community Edition is a MySQL-compatible database with strong support from the open source community, and extra features and performance optimizations.
Supports database size up to 64 TiB.
Supports General Purpose, Memory Optimized, and Burstable Performance instance classes.
Supports automated backup and point-in-time recovery.
Supports up to 5 Read Replicas per instance, within a single Region or cross-region.
Supports global transaction ID (GTID) and thread pooling.
Developed and supported by the MariaDB open source community.
PostgreSQL is a powerful, open-source object-relational database system with a strong reputation of reliability, stability, and correctness.
High reliability and stability in a variety of workloads.
Advanced features to perform in high-volume environments.
Vibrant open-source community that releases new features multiple times per year.
Supports multiple extensions that add even more functionality to the database.
Supports General Purpose, Memory Optimized, and Burstable Performance instance classes.
The most Oracle-compatible open-source database.
Oracle® Database is a relational database management system developed by Oracle. Amazon RDS makes it easy to set up, operate, and scale Oracle Database deployments in the cloud. With Amazon RDS, you can deploy multiple editions of Oracle Database in minutes with cost-efficient and re-sizable hardware capacity.
SQL Server is a relational database management system developed by Microsoft. Amazon RDS for SQL Server makes it easy to set up, operate, and scale SQL Server deployments in the cloud. With Amazon RDS, you can deploy multiple editions of SQL Server (2012, 2014, 2016, and 2017) including Express, Web, Standard and Enterprise, in minutes with cost-efficient and re-sizable compute capacity.
Further parameters will depend on what we select
https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_SettingUp.html
RDS will be more useful than DynamoDB in our scenario
https://www.sumologic.com/blog/migrating-aws-database-services/
The Amazon RDS Free Tier is available to you for 12 months. Each calendar month, the free tier will allow you to use the Amazon RDS resources listed below for free: 750 hrs of Amazon RDS in a Single-AZ db.t2.micro Instance. 20 GB of General Purpose Storage (SSD). 20 GB for automated backup storage and any user-initiated DB Snapshots.
This wiki page has instructions for building a GRNmap workbook: https://xmlpipedb.cs.lmu.edu/biodb/fall2019/index.php/Week_10 so @afiller1 can see the queries performed by the students using GRNmap.
Notes from my meeting with Dr. Dahlquist in an attempt to draw out how the database will look: Contains fields, how the tables correspond to one another, attempts at normalizing, highlighting spots that may require more work to normalize, and ways that we may want to move the data around in the actual database. I'm hoping to stop by @dondi 's office hours on Tuesday to draw out a finalized ERD for how we want this to work, so that I can get to uploading the data to our RDS instance by the middle of the week.
Copied over from #826.
I posted on Box the Microsoft Access database produced by the students in the Biological Databases course last semester, along with the ReadMe's written by the three groups. The ReadMe's may not be so helpful since there weren't clear guidelines for them, but they are there. You should have received an invite to edit to your lion e-mail address.
The idea is that there will be several type of tables:
the latter two are for GRNmap.
There are at least five tasks, as I see it (@dondi will help scope this for your 402)
I'm trying to upload data to RDS using a CSV format, because you can copy data into the database using psql using a CSV and I wanted to try it out. A small problem: the RDS documentation says "You can run the \copy command from the psql prompt to import data into a table on a PostgreSQL DB instance. The table must already exist on the DB instance. " The table exists on the DB instance (the skeleton of the table, anyway), but it seems as though I need to find some way to upload the CSV to the instance? I'm not sure how to do that, and I can't find the answer to that exact question... I'm running the command:
\copy refs from 'References.csv' with delimiter ',';
According to the documentation, this is the correct syntax. I've tried several different ways of stating my link to the References.csv file, but I'm not sure how to make it connected to my instance...since I'm running the command while ssh'd into my db instance. Just need a little help with the troubleshooting @dondi ! (Documentation link is here.)
This week, I have also cleaned up my idea for the database design (will attach an image). In the coming week, I will finish uploading a the data to AWS to have a rough draft of the database, and I will start writing APIs so that we can connect the database to GRNsight.
I have transposed the data as we outlined last week to produce a rough draft of our expression data table. I am hoping to go over what I have done in respect to this at the next meeting. I will upload the Excel workbooks / csv's that contain the data that I have reconfigured via Box. I have also figured out how to upload data to the RDS instance by uploading csv files to the EC2 instance and then reading it in from there. So, all in all, I did a ton of data manipulation on our expression data to compile it into one sheet, and I have uploaded all of the rough drafts of this data to the RDS instance. In the coming week, I will clean up that data as we want and start working on APIs with the hope of getting at least one API working by next week (with functionality so that the database data is accessible via GRNsight).
What @afiller1 needs from me:
Note that @afiller1 did a lot of manipulation of the data by hand. @kdahlquist still needs to check the data quality and is concerned that this might have to be done again if the data changes. Also, in the future we would add new datasets. Can some type of helper function be written to format the data?
She has questions about datatypes for @dondi.
She wants to go over the APIs she needs to write; what functionality and what methods.
The basic use case would be if a user uploads an unweighted network and wants to get the production rates, degradation rates, and expression data for exporting to a GRNmap excel workbook. The list of genes will be in the json object and then queries will be sent to the database to grab the data for those genes.
Currently there is only one set of degradation rates, and the initial guesses for the production rates, but there could be more in the future. Currently there are 4 expression datasets to choose from (and the Dahlquist set has multiple strains in it).
So we need to ask the user, which expression data they want (drop-down menu?)
There are some yeast standard names with special characters, commas and apostrophes. Currently @afiller1 turned the commas into periods for the csv. Might need to switch to tab-delimited, though.
I reformatted the data in the database a bit so that it's more useful in making the queries that will actually be made. Specifically, I added a column with simplified dataset names corresponding to which dataset the data comes from, and I added a time points column that contains actual numbers (15, 30, 60, etc) instead of the long label that came from the initial data column headers (so instead of BY4741_potassium_50mM_LogFC_t40m-4, I just have the timepoints column say 40, which corresponds to the time point of this entry).
Now that the database code is in beta, we can shift to a permanent server that can keep running permanently. @afiller1 will coordinate with Masao to get a new instance up on GRNsight’s final account then point the beta server to that instance.
The new RDS instance is connected to beta and running. It contains the most up-to-date data, which I will upload to the GRNsight shared box. I propose closing this issue and opening separate, specific issues if any problems occur with this aspect of my project. @dondi @kdahlquist
Closing!
FINISH BY 2/6/2020
FINISH BY 2/10/2020
GENERAL