MarkIannucci / headscale-on-fly-io

MIT License
4 stars 2 forks source link

Deploy Primary<->Secondary using LiteFS for replication #6

Open MarkIannucci opened 1 year ago

MarkIannucci commented 1 year ago

The current deployment will be hard down and incur data loss when Fly loses the host that our app is running on. They'll restore our persistent volume from the most recent snapshot.

Fly.io has recently released LiteFS to deliver distributed sqlite. LiteFS has an automatic failover functionality provided by consul or it can use static primary and secondaries. The application needs to be aware of LiteFS's primary file functionality in order to forward writes to the correct location so they get processed correctly. Modifying headscale to do this is beyond the scope of this effort, especially since they plan to release write forwarding functionality which won't need application modification soon.

We won't use the automated failover functionality because I can't figure out how to force a connection to the primary machine given current headscale + tailscale code. Instead, we will use two different apps deployed in different regions and mark one as primary and the other secondary and see if we can connect the volumes using the private network functionality. If that works, we will then create a callable workflow which we can use to trigger a manual failover between the two. We will use some external DNS to route to the static primary app.

MarkIannucci commented 1 year ago

Spent a bit more time thinking about this problem and realized that I could functionally force connection to the primary container by putting the secondaries across the world. That sounds like fun and it will be less work because we don't have to write the manual failover code, so I'm going to do that.

MarkIannucci commented 1 year ago

I couldn't figure out how to get the app to consistently start in one region which was implicitly required. See #13 , #14.

Instead we will use consul but deploy with the lease candidate functionality.

MarkIannucci commented 1 year ago

Reading through the issues, it looks like static is the way to go currently. I will proceed in that direction.

MarkIannucci commented 1 year ago

We will use a volume snapshot for the secondary nodes, an environment variable to define the primary environment, and imperative commands to scale up after initial deploy. Deploys will have outages because there will only be one node in the candidate region which will be primary.

MarkIannucci commented 1 year ago

Lots of progress in #19 . Need to write a readme with instructions on how to deploy using this code as well as confirm that my theory on how to failover works (swap the values in the primary and secondary regions).