support read-replicas in PostgreSQL

vroldanbet commented 1 year ago

See https://github.com/authzed/spicedb-operator/issues/195#issuecomment-1543594303

Folks use read replicas as a means to scale read-heavy workloads. This isn't trivial due to asynchronous replication typically happening in these setups. Nevertheless, we could explore a way to detect if data has been replicated so that SpiceDB can trust the state of the replica hit for its computations.

williamdclt commented 1 year ago

Despite it not being supported, in our product we do use read replicas. Here's how we made it work:

Infrastructure: We use AWS RDS Aurora. It gives us a "master" endpoint and a "replica" endpoint. All writes have to go to the former, reads can go to either, reading from the replica has no consistency guarantee (it might be lagging, in practice it tends to be low-single-digit ms). We can create as many replicas as we want, Aurora handles load balancing automatically when using the replica endpoint. We only have a single replica for our 400 req/s, as we found we need a quite large database instance (r6g.large, need the RAM) so it's enough to handle the load. I don't exclude adding more replica in the future though.

Kubernetes: We have two SpiceDB service: spicedb-master and spicedb-replica. The spicedb-master is configured to use the master database, and the spicedb-replica is configured to use the replica pool. spicedb-replica has readonly enabled (writing would fail anyway, as postgres replicas don't accept writes).

Routing: Where we send CheckPermission queries depends on the consistency requirement.

minimizeLatency: always call spicedb-replica
fullyConsistent: always call spicedb-master
- We could make it smarter: the replica is likely up-to-date at any given instant. It's simple and the master can support this load though (we barely use this consistency requirement), so we don't bother with more complex approaches that would maximise utilisation of replica
atLeastAsFresh: call spicedb-replica. If it fails, inspect the error message to detect if it's an "unknown revision" (which we interpret as meaning that the revision isn't replicated yet). If so, fallback to spicedb-master.
- We could make it smarter to know (or guess) if the revision is replicated. In practice it's very very likely that it is, so we don't bother and accept the cost of making 2 queries. I don't have numbers on how often the fallback happens.

This routing is done client-side, in an internal reusable SDK, so it's transparent to the service using the SDK. We could have implemented a middleware service sitting between clients and the 2 spicedb services, but I don't really want to maintain it for now and it would have an impact on response time.

Consistency guarantee: In this setup we have the same consistency guarantees for fullyConsistent and atLeastAsFresh as we would in a "normal" setup using a non-replicated datastore. For minimizeLatency, we accept that we have replication lag on top of the quantization interval (set to 5s), as we expect this lag to be low (single-digit ms) in normal operation it shouldn't noticeably impact UX. It is possible that an incident happen where the replication lag gets much higher (I've seen multi-minutes lags happen): in this case the users can definitely see out-of-date permissions. This would be a problem of course, but we assess the risk and the impact much lower than the things that can happen in a non-replicated setup (database outages, writes being affected by reads or vice-versa, no horizontal scaling possibility, high downtime for database operations...).

manderson202 commented 1 year ago

We have a setup similar to the above, except we are using RDS with vanilla PostgreSQL logical replication. We have accepted that read replicas will be eventually consistent (equivalent to minimizeLatency), but since that latency is typically very small we have accepted the tradeoff. Would be great to have first-class support for this, though, because we may not be aware of a future change that would alter our assumptions.

Another problem we've noticed, however, is related to initializing new replicas where initial reads are not able to see any data. My understanding of why is that the data replicated over to the new instance will have txn-ids stored in the tables larger than that instance's current snapshot. The reads fail because SpiceDB doesn't "see" these larger txns. Not sure if there is a good way around this, but wanted to mention.

This could also be a problem with creating a new clone using pg_dump/pg_restore for troubleshooting or potentially major version upgrades depending on process.

Is my understanding correct on this? Are there ways to handle initializing a new database from an existing one?

vroldanbet commented 1 year ago

@manderson202 sorry I totally missed your question 🙏🏻

Another problem we've noticed, however, is related to initializing new replicas where initial reads are not able to see any data. My understanding of why is that the data replicated over to the new instance will have txn-ids stored in the tables larger than that instance's current snapshot. The reads fail because SpiceDB doesn't "see" these larger txns. Not sure if there is a good way around this, but wanted to mention.

I believe that's right - only when the replica has caught up will SpiceDB be able to "see" the txid. If a new replica is added behind a load balancer and it is considerably behind replication it will cause that kind of errors. I'd suggest only adding the replica behind the load balancer when it has mostly caught up and replication lag is under a healthy threshold.

The effect of a newly spun replica is no different from a replica's typical lag. It just increases the likelihood of missing revisions.

This could also be a problem with creating a new clone using pg_dump/pg_restore for troubleshooting or potentially major version upgrades depending on process.

Correct, we've had folks reporting that pg_dump/pg_restore does not work at all. I could imagine this would be related to the fact that txid for the restored data does not match the one stored in SpiceDB's internal MVCC table. For now, we recommend using SpiceDB native's BulkImport and BulkExport functionality.

williamdclt commented 1 year ago

we've had folks reporting that pg_dump/pg_restore does not work at all

That's a bit worrying! If this doesn't work, I could imagine that other operations might break too.

If I understand correctly, the problem would manifest by values in the deleted_xid column being higher than the current txid. This would make all these relationships effectively invisible to spicedb, correct? Therefore falling back to the "fully consistent" snapshot ?

josephschorr commented 1 year ago

That's a bit worrying!

Its actually fairly expected, given how the transaction IDs are used.

Therefore falling back to the "fully consistent" snapshot ?

They wouldn't apply at all, in fact. In short, it would be an indeterminate state

williamdclt commented 1 year ago

Its actually fairly expected, given how the transaction IDs are used.

For a maintainer, yes. For a user though? I'd say that it's an implementation concern leaking outside its abstraction boundary. I don't think it's reasonable to expect users of spicedb to be aware of how zookie-based consistency is implemented, and the consequences of it on their datastore operations?

I think that "I cannot backup and restore my database with standard tooling" is pretty unexpected behaviour!

josephschorr commented 1 year ago

I agree, I'm just pointing out that it isn't an unexpected bug, given the design chosen.

vroldanbet commented 1 year ago

@williamdclt I'd suggest opening a new issue to keep track of the challenges of backup-restore with SpiceDB's MVCC implementation in Postgres.

josephschorr commented 9 months ago

As a note we added a spicedb datastore repair command in v1.28 that should allow for easy repairing of the transaction ID counter

vroldanbet commented 2 months ago

This was now implemented via https://github.com/authzed/spicedb/pull/1878 and should make it in the next release

authzed / spicedb

support read-replicas in PostgreSQL #1320