RTXteam / RTX

Software repo for Team Expander Agent (Oregon State U., Institute for Systems Biology, and Penn State U.)
https://arax.ncats.io/
MIT License
33 stars 21 forks source link

Have we considered moving our databases to at EBS volume? #2197

Open edeutsch opened 11 months ago

edeutsch commented 11 months ago

It seems like 95% of our deployment problems to ITRB are the "databases", i.e. massive files that need to be copied to our servers every time we deploy something and the wind is blowing from the west.

Might we solve most of our unpleasant problems by just migrating all our databases to a central MySQL server? these: https://github.com/RTXteam/RTX/blob/master/code/config_dbs.json

This does create a single point of failure for all our instances. But we effectively already have a single point of failure that is the arax-responses MySQL server, and it has never failed. While all these individual database files are independent points of failure and are in a broken state somewhere about 5% of the time.

These databases cause about 20% of deployments to take a hugely long time while databases are downloading of which 50% of those deployments fail because at least one database failed to download completely.

There is the added complexity of managing a RDBMS and loading all our artifacts into it. Perhaps a slight speed penalty, too.

What do you think?

saramsey commented 10 months ago

Agreed, MySQL is rock solid. Interesting idea.

This is a good discussion topic for when we have a block of time and a whiteboard.

saramsey commented 10 months ago

Based on today's conversation-- there is interest in an AWS solution using an S3 snapshot of a volume containing an ext4 filesystem containing all the databases, and then (at build time) just provision an EBS volume based on that snapshot, and attach the volume into the instance and mount the filesystem as /translatordb or whatever.

edeutsch commented 10 months ago

Today/tonight after merging some provisioning PR (https://github.com/RTXteam/RTX/pull/2206/files), Pouyan tried to get KG2 up and running again. It resulted in two separate downtimes of 2h45m while it was "downloading the databases":

image