citusdata / pg_shard

ATTENTION: pg_shard is superseded by Citus, its more powerful replacement
https://github.com/citusdata/citus
GNU Lesser General Public License v3.0
1.06k stars 63 forks source link

Support multi-statement transactions #87

Open bonesmoses opened 9 years ago

bonesmoses commented 9 years ago

Currently when attempting to execute a query within a transaction, the following output is produced:

ERROR:  distributed commands cannot run inside a transaction block

Since transactions are not supported, and only single INSERTS are allowed, this makes data import incredibly slow. Further, this makes pg_shard completely unsuitable for use within a production environment where data integrity is important.

jasonmp85 commented 9 years ago

Question

Which use-case are you more interested in?

Atomicity

pg_shard works well for applications that might otherwise be using e.g. mongoDB. From their documentation on the matter:

When a single write operation modifies multiple documents, the modification of each document is atomic, but the operation as a whole is not atomic and other operations may interleave.

Because pg_shard is an extension you may selectively apply to certain tables, it can allow users to maintain an all-PostgreSQL datastore, using certain tables with your usual ORMs, etc., and interacting with others in a manner more similar to e.g. mongoDB.

Data Ingest

Though transactions are one way of handling bulk ingest, this could also be addressed with better COPY support, which is already in our backlog as #62. We also recently closed #61, though that issue was more about addressing usability than performance.

It's worth mentioning that a single PostgreSQL box does not scale well to many writers—no matter how many cores—due to existing issues with contention around buffer locking, etc. But because pg_shard is farming out its INSERTs to many smaller PostgreSQL instances, it promises the possibility of sidestepping those scaling limitations.

Philosophy

Ultimately, pg_shard is unlikely to become a product you can just drop in behind, say, Rails and have it "just work" in the sense that Rails is entirely agnostic to the sharding but can continue behaving as though it's talking to a single PostgreSQL box. But we do have our eye on certain targeted use cases and scaling performance going forward, so these bits of feedback from the community are great to have as we prioritize the backlog.

bonesmoses commented 9 years ago

The use case that has the most utility is a multi-statement transaction targeting the master and cascading to the child shards.

However, the primary problem is that pg_shard obfuscates its hashing and distribution algorithms (as it should, really) so there's no way to reliably target specific shards. That makes it impossible to even emulate transaction support at the application level. Beyond that though, is that the mere presence of a transaction context automatically disables the module.

Anyone using a connection with autocommit disabled won't be compatible with any pg_shard table in its current form. Even a single select statement wrapped in a transaction will fail. Or more likely, several queries, one of which targets a pg_shard table, would still be impossible.

digi604 commented 9 years ago

django and peewee use transactions by default as well and i am not even sure yet you are able to disable this behavior as it may be even a psycopg2 or psycopg2cffi thing. And i just rewrote the hole stack for postgres because mongodb sucks BADLY if you go over 100m rows.

jasonmp85 commented 9 years ago

@digi604: It appears possible to disable transactions within psycopg. See here and here for more information.

ddorian commented 9 years ago

I think it would be easier to support multiple statements in 1 sql-string all targeting the same shard (all of them specifying the sharding column) at first. (my usecase)

amiorin commented 8 years ago

@jasonmp85 I have a use case for Multi-statement transactions targeting a single shard: The event store of event sourcing (ES) design pattern. In this use case you have two tables: events table and optimistic-locks table that it's used to insert events transactionally in table events. ES on sql is simple, but it doesn't scale. ES on nosql is hard, but it scales. I was hoping pg_shard to have the best of both worlds, but it's more like nosql in this case.

I think voltdb is the only solution at the moment that scales like nosql and it is simple to implement.

More info about ES+CQRS:

ddorian commented 8 years ago

Note in voltdb you have transactions only in 1 db-call (ex sending multiple sql-statements in 1 string) and not several strings like you can do with plain postgresql/mysql. Meaning, session is autocommit. The presentation had too many buzzwords.

amiorin commented 8 years ago

@ddorian You are right about voltdb. Let's say it's possible to implement the optimistic lock in PL/pgSQL and pg_shard. I still have the problem that events_123 and optimistic_lock_123 tables are not on the same server. :-(