Create documentation about transactional guarantees and replication

kellabyte commented 8 years ago

I think it would be really useful to have documentation linked somewhere that discusses what transactional guaranteed Citus provides and additionally how the replication protocol works.

It was discussed on IRC with @jasonmp85 that Citus offers read-your-own-writes so some documentation that supports how that is provided would be good.

Some details about the distributed model and replication model etc.

There was discussion that Citus fits analytical workloads better than OLTP workloads. Explain this point and explain why that is in regards to how the system works.

lfittl commented 8 years ago

For reference purposes, here is the part of the discussion from IRC: https://gist.github.com/lfittl/d54dc3e9754b975c936e#file-gistfile1-txt-L61

A bit hard to read the non-wrapped lines, but if someone wants to read it, there it is :)

gregscds commented 8 years ago

I wanted to read it, and I cleaned up the text as I was doing it.

Development history

Up until now, Citus (née CitusDB) didn't support DML statements. the bread-and-butter it offered was an SQL query interface to plain-jane PostgreSQL instances running within a cluster. a master node acted as the coordinator of this cluster, keeping track of metadata and planning distributed jobs to answer user queries (distributed joins, filters, sorts, transforms, etc.) Data was loaded (and shards created) using a \stage command, which took input data, split it into shards, and atomically created those shards and replicas. Don't read too much into the word "replica": they are just copies of each other.

To meet the needs of customers who were hacking in their own DML query workarounds, we wrote and open-sourced pg_shard, which was essentially an overlay extension that understood Citus metadata and shard behavior, but could act alone. As of Citus 5.0, that work is integrated into Citus itself, so you get (a) the distributed query/job planning, (b) \stage "bulk load" ingest, and (c) the DML support offered by pg_shard.

The pg_shard DML propagation essentially uses statement-level replication to send incoming statements to the affected nodes. This has several drawbacks: no multi-shard DML support, no multi-statement transaction support, and utilizes shard locks on the master to ensure safe commutability of operations (i.e. no racing UPDATEs applying to replicas in different order). When pg_shard was written, the replication mechanisms present in PostgreSQL offered little beyond being able to completely (byte-for-byte) replicate a machine. Then 9.4 brought LLSR (logical log streaming replication) which offers the potential for streaming subsets of data in a more efficient (and less manual) fashion.

Transaction replication

You can see your own writes; if a replica fails during your write, it is marked as unhealthy until a repair operation is run. If a replica fails during a read, you will fail over to the next replica, guaranteed to see your own writes, but perhaps will see stale data wrt other sessions' racing writes so there's our history. the inception was really "load a bunch of data and use distributed processing to get to it and it's "just SQL"", but the DML overlay came along with simple single-statement-single-shard distribution

Replicated writes are synchronous and completed before the ack of the write returns to the client. If you get an ack to commit, your write went through (subject to the PostgreSQL settings on your remote nodes, i.e. if you turn off sync or something we can't do much)

But since we're moving to masterless, consistency is becoming more present in our minds. We are likely not going to target full ACID, or global linearizability, but we want to improve our isolation and replication picture. Obviously there's the "client–server need 2PC" inconsistency question that aphyr brought up in his PostgreSQL post a while back (if the connection drops, the client doesn't know whether the write occurred)

You say you can do read your own writes because its guaranteed your read won't be before the write on all replicas because replicas have to ack synchronously before the read can happen.

Use cases

If you're thinking "OLTP/ORM use case", i.e. "PostgreSQL semantics, distributed", I obviously can't rule out anything, but that's unlikely to be what we're targeting (and you'd probably be better off with 2PC, a global transaction/snapshot manager/something like PostgreSQL-XC if that works for you)

I think this kind of thing might be obliquely inferred from what we highlight in our use cases (i.e. reading into what's not there), but that's obviously not terribly explicit.

Transaction visibility

You can think of the various replication approaches in terms of (a) distribution model, (b) replication model, (c) fallouts thereof (how they interact with reads and writes). The standard PostgreSQL transaction model is documented here: http://www.postgresql.org/docs/9.5/static/transaction-iso.html

You can talk about guarantees in terms of isolation levels or terms like linearizable and all those. And you can further explain how the guarantees are implemented in explaining the replication protocol, query planner, etc. Like for example HyperDex has documentation about how Hyperspace Hashing works, and how it provides ACID guarantees by explaining how hyperspace hashing + it's replication protocol works.

DMLs in Citus are already read-your-own. I don't think we want to relax that at any rate. Configurable relaxation might be nice at some point, but I would focus on making a correct system first before making a looser one

Comparing to other data models

The RAMP stuff's been on my mind as we go toward masterless, but a difference between us and some other systems (as an extreme example, voltdb) is that we aren't going to impose query-language constraints or particular paradigms (above and beyond PostgreSQL SQL itself) on users. Especially since people who pick up "distributed" Postgres will assume it's getting strongly consistent guarantees since Postgres is ACID

voltdb does a lot of things to ensure transaction guarantees, like a lot of specific things you need to avoid because the server has to generate them. rand() and date() calls for example can't be in SQL statements. it replicates a command log, so if it replays those commands its going to get different values

If you have a point where everything is turned into a single log it's a lot easier to guarantee it'll be linearlizable. You can also do batched transfers nicely because you're acking batches not statements. It's simpler to do things like lower replication guarantees by allowing that to be async and ack prematurely. And since you can do batch transfers, much more WAN friendly

A more complicated replication system gets really tough to add different guarantees. You could do something like synchronous log replication inside a LAN and asynchronous log replication over WAN. The "ack synchronously after on a disk in this DC, asynchronously replicate to another DC" pattern is a common one. In too many DBs you need to loosen the local guarantees to make it work over a LAN. Or you have to manage 2 entirely different replication systems.

ozgune commented 8 years ago

@gregscds -- thank you for the comprehensive summary!

We also heard this question multiple times in the corresponding Hacker News thread (top two questions)

@kellabyte, I'm assigning this issue to myself, and I'll own documenting our replication model and transactional guarantees. I'll also incorporate parts of Greg's write-up into that document, and I expect to have this up on our website by April.

kellabyte commented 8 years ago

:+1:

ozgune commented 7 years ago

@kellabyte / @gregscds -- I wanted to provide a long overdue comment on this issue.

First, we have a blog post available in draft form that explains Citus' replication models in detail and how they evolved over time. We plan to publish this post later in the week. If you'd like to have an early preview, could you drop me a note at ozgun at citusdata.com?

Second, we find that our customers use one of two replication models with Citus. In particular, Citus deployments could use either the built-in statement based replication or rely on streaming replication. This has several implications across product and deployment dimensions:

Citus Community comes with built in statement replication. Citus Cloud uses streaming replication. This is because we can manage the complexity around streaming replication setup and node failover in the Cloud more easily.
Citus customers usually have two use cases: multi-tenant applications and real-time analytics (also known as HTAP). Statement replication works well for the real-time analytics use case, but has certain challenges for use with multi-tenant databases. We explained some of those challenges in #810.

We intend for the blog post to articulate Citus' replication model(s) in detail and also describe the above learnings.

ozgune commented 7 years ago

@kellabyte / @gregscds -- I wanted to drop a quick update on this issue. We just published a blog post that explains Citus' two replication models and how they evolved over time: https://www.citusdata.com/blog/2016/12/15/citus-replication-model-today-and-tomorrow/

I'm planning to keep this issue open for another two weeks to answer any questions or feedback points. Please let us know if that doesn't work.

ozgune commented 7 years ago

@kellabyte / @gregscds -- We summarized the two replication models available in Citus in the blog post above. I'm closing this issue with that in mind. If you have any questions or comments, please feel free to reopen this issue or ping me at ozgun @ citusdata.com.

citusdata / citus

Create documentation about transactional guarantees and replication #403