fermyon / installer

Fermyon Installer
https://fermyon.dev
Apache License 2.0
160 stars 39 forks source link

Scaling Fermyon to multiple nodes #63

Open radu-matei opened 2 years ago

radu-matei commented 2 years ago

Currently, the Terraform configuration that deploys Fermyon on AWS only creates one node — we should explore scaling the cluster beyond a single node.

ref #62

FrankYang0529 commented 2 years ago

I would like to try this issue. My first thought is using systemd to manage consul, nomad, and vault on multiple nodes.

After deploying all hasicorp stacks, we may need to add scaling out ability to bindle first. If we can do all of this, then Fermyon platform can be on multiple nodes.

vdice commented 2 years ago

@FrankYang0529 Sounds like a great plan! Agreed, converting the Hashicorp services to systemd is the first prerequisite to withstand instance restarts and process terminations. The Consul, Vault and Nomad configuration updates you've mentioned sound right to me.

I'd say scaling Bindle can be an optional follow-up. Bindle doesn't necessarily need to run on every Nomad agent/node -- it can run as a service of count 1 and Nomad will just make sure it is scheduled appropriately. In this case, we could also utilize a host volume to at least make sure bindles are persisted at the host level, pending support for scaling the service out (or other persistence options).

We'd naturally want to increase the hippo replica count (or convert to system) for HA. Traefik should probably change to a system job to be sure it runs on each agent node or convert to a systemd service alongside Nomad/Consul/Vault, again to run on each agent node/host.

FrankYang0529 commented 2 years ago

@vdice Thanks for your suggestion! It looks like a workable plan. For Bindle, I feel that we still need scaling-out ability. If we use host volume, we can't lose that node. We can do this step by step. Let me work on Hashicorp stacks first. 👍🏻