edgi-govdata-archiving / web-monitoring

Documentation and project-wide issues for the Website Monitoring project (a.k.a. "Scanner")
Creative Commons Attribution Share Alike 4.0 International
106 stars 17 forks source link

Storing state related to the cluster #117

Open danielballan opened 6 years ago

danielballan commented 6 years ago

Things we are trying to keep in mind:

  1. Since we are managing our own cluster (via kops) there is the risk that it could get broken in a way that we don't know how to fix. We should design our deployment such that re-deploying the cluster from scratch is not costly.
  2. We should avoid coupling ourselves tightly to a specific cloud provider. Today we have AWS credits, but in the future we have a reasons to want Google, Azure, some university-affiliated resources, etc.
  3. One big reason for creating Scanner was to save on costs. If we aren't careful about the AWS resources we use, running Scanner may be no cheaper than the old system.

We have two kinds of state:

For the database, we currently use RDS. This satisfies (1) because it is outside the cluster, and it satisfies (2) because although RDS is proprietary, postgres itself is an open standard available anywhere. For the cache, we use a Redis image deployed inside the cluster. This violates (1). We could instead use Elasticache, which meets (1) and (2) -- analogous to our use of RDS. However, this strategy is not optimal with respect to (3): both RDS and Elasticache are more expensive that hand-deployed postgres and Redis.

My personal opinion is that we are more crunched for developer-time than AWS credit at this moment, so we should use RDS and Elasticache for now and keep an eye on the operating costs. We know it is technically straightforward to hand-deploy postgres and Redis on EC2 instances outside the cluster. Once we have a good handle on the operating costs, we can judge whether it is worth the potential cost savings to do so.

Thoughts?

danielballan commented 6 years ago

Somewhat tangentially, we should also factor lifecycle management of the S3 buckets into our decisions about operating costs.

Mr0grog commented 6 years ago

Note we have no AWS credits at the moment 😬, so we need to evaluate dollars on hand before concluding:

we are more crunched for developer-time than AWS credit at this moment

Otherwise I’d very much agree :)

Either way, though, I think our usage of Redis (for caching or queueing) when correctly set is not large (we are currently somewhat over-deployed with an xlarge [for caching]). At the level of what we need, the cost difference may not be enough to matter anyway (the EC2 vs. Elasticache cost difference is something like 1.25x - 1.5x).


Also, there’s an important differentiation to make here:

Mr0grog commented 6 years ago

Also also: I’m not at all hot on managing Postgres ourselves — automatic snapshots and backups and etc etc is worth so much. I don’t feel like Elasticache offers nearly as much that is special about Redis, though (so how I’d weight the costs is different for Postgres vs. Redis).

danielballan commented 6 years ago

Note we have no AWS credits at the moment

Goes to show that I only partially keep up with Slack, and apparently with the passage of time!

automatic snapshots and backups and etc etc is worth so much. I don’t feel like Elasticache offers nearly as much that is special about Redis

Fair point. A documented and separately automated process (Ansible?) for deploying Redis on EC2 would be satisfactory, I think, for both the queue and the cache.

Mr0grog commented 6 years ago

A documented and separately automated process (Ansible?) for deploying Redis on EC2 would be satisfactory, I think, for both the queue and the cache.

Well, I do think it’s still worth talking through whether Elasticache makes sense. I think I just laid out a bunch of pro and con points in a super disorganized way. I’m not actually sure whether I think we should be using it or not (while I am sure we should be using RDS).

But either way! We should probably make an issue to A) document all our non-kubernetes resources & deployments and B) automate them (whether by Ansible or Fabric or Terraform or straight-up Python scripts). The -kube repo should probably become the -deployment or -ops or -sre repo.

danielballan commented 6 years ago

Summary of conversation on call: Given current cost constraints, stick with manually-managed Redis but document the process. Maybe someday move that process into Ansible. Stick with RDS.

Mr0grog commented 6 years ago

Update: credits issues resolved. Here’s my proposal for now:

Unlike the cache, we do want to have some resiliency guarantees for the queues, and offloading that guarantee to AWS is nice (they aren't critical, but it would be a big bonus).

I can see eventually moving the Redis queues to a hand-managed instance, but by the time we get there we will have hopefully rewritten the DB and our queue management story will probably be pretty different anyhow.

Mr0grog commented 6 years ago

Side note: I need to re-deploy the cache machine in us-west-2. It’s currently in us-east-1 because that's where the API service was when it was Heroku-based. (#119)

Mr0grog commented 6 years ago

119 (shutting down the old cache and starting a new one in the right data center) is now done. All that’s left here is actually setting up Elasticache correctly.