Closed jellevandenhooff closed 2 months ago
The two-step commit logic is great for resiliency, but we are not going to take the extra complexity of multi-node setups. In a multi-log ecosystem like CT, a single node can provide sufficient availability for a single log.
Also note that there is no worry-free replacing or upgrading of individual nodes if the only node that "matters" is the leader. It just makes upgrades/rollbacks (if possible) a little faster.
Thanks for thinking about it. Makes sense to me!
I think if desirable all of this logic could live outside of the sunlight server. If only there was a nice to way to share the cache…
With the two-step commit logic in https://github.com/FiloSottile/sunlight/pull/18, allowing safe concurrent access to the log, the most important pieces are in place for a multi-node, high-availability setup. Supporting multiple nodes is not one of the stated goals of Sunlight, but it seems worth considering anyway.
Trade-offs
The design document https://filippo.io/a-different-CT-log ("On availability") argues that a simpler, single-node log is a reasonable trade-off for the CT ecosystem. Supporting multiple nodes would add complexity to the Sunlight design. I do not know if that complexity would be worth the higher availability, and I do not know what the availability changes would be like in practice.
For a deployment on a cloud provider, a multi-node setup might be worthwhile if the object storage and the lock backend have significantly higher uptime than individual servers. Another benefit could be easier maintenance, if a multi-node setup allows worry-free replacing or upgrading of individual nodes.
Design sketch
One approach to support multiple nodes is to add leader election and request forwarding to the Sunlight server. A single leader node would be actively pooling certificates and updating the backends, while the other nodes would forward incoming requests to the leader.
The existing lock backend could function as a leader-election mechanism. If the lock's timestamp is recent, a node can assume some other node is the active leader. If it is not, a node can try to become the leader by updating the lock.
To forward requests, either the lock could include the current leader's identity as metadata, or each nodes can try and connect to all other nodes and ask them if they are the leader. Non-leader nodes can proxy incoming requests to the current leader.
That leaves the deduplication cache. Each node could keep its local deduplication cache up to date by repeatedly adding certificates downloaded from the storage backend. This would be complicated logic does not exist yet. It might be nice to have that logic anyway, to support backfilling the cache for disaster recovery. Alternatively, something like Litefs could share the cache between nodes but that adds quite a bit more machinery.
A node might be unhealthy in some way and still become the leader. For example, it might not be reachable by CAs yet still have access to the lock backend. Or it might be able to update the lock backend, but not have write access to the storage backend. One way to handle that might be to add some internal health monitoring to the leader, where if the sequencer fails with a fatal error, or if a node does not receive any incoming requests, it will give up its leader position for some time.