Closed citus-github-bot closed 7 years ago
Comment by jasonmp85 Friday Mar 13, 2015 at 23:21 GMT
Which use-case are you more interested in?
pg_shard
works well for applications that might otherwise be using e.g. mongoDB. From their documentation on the matter:
When a single write operation modifies multiple documents, the modification of each document is atomic, but the operation as a whole is not atomic and other operations may interleave.
Because pg_shard
is an extension you may selectively apply to certain tables, it can allow users to maintain an all-PostgreSQL datastore, using certain tables with your usual ORMs, etc., and interacting with others in a manner more similar to e.g. mongoDB.
Though transactions are one way of handling bulk ingest, this could also be addressed with better COPY
support, which is already in our backlog as 62. We also recently closed 61, though that issue was more about addressing usability than performance.
It's worth mentioning that a single PostgreSQL box does not scale well to many writers—no matter how many cores—due to existing issues with contention around buffer locking, etc. But because pg_shard
is farming out its INSERT
s to many smaller PostgreSQL instances, it promises the possibility of sidestepping those scaling limitations.
Ultimately, pg_shard
is unlikely to become a product you can just drop in behind, say, Rails and have it "just work" in the sense that Rails is entirely agnostic to the sharding but can continue behaving as though it's talking to a single PostgreSQL box. But we do have our eye on certain targeted use cases and scaling performance going forward, so these bits of feedback from the community are great to have as we prioritize the backlog.
Comment by trifthen Saturday Mar 14, 2015 at 18:20 GMT
The use case that has the most utility is a multi-statement transaction targeting the master and cascading to the child shards.
However, the primary problem is that pg_shard
obfuscates its hashing and distribution algorithms (as it should, really) so there's no way to reliably target specific shards. That makes it impossible to even emulate transaction support at the application level. Beyond that though, is that the mere presence of a transaction context automatically disables the module.
Anyone using a connection with autocommit disabled won't be compatible with any pg_shard
table in its current form. Even a single select statement wrapped in a transaction will fail. Or more likely, several queries, one of which targets a pg_shard
table, would still be impossible.
Comment by digi604 Thursday Jul 23, 2015 at 22:27 GMT
django and peewee use transactions by default as well and i am not even sure yet you are able to disable this behavior as it may be even a psycopg2 or psycopg2cffi thing. And i just rewrote the hole stack for postgres because mongodb sucks BADLY if you go over 100m rows.
Comment by ddorian Tuesday Sep 08, 2015 at 22:54 GMT
I think it would be easier to support multiple statements in 1 sql-string all targeting the same shard (all of them specifying the sharding column) at first. (my usecase)
Comment by amiorin Monday Dec 07, 2015 at 09:25 GMT
@jasonmp85 I have a use case for Multi-statement transactions targeting a single shard: The event store of event sourcing (ES) design pattern. In this use case you have two tables: events table and optimistic-locks table that it's used to insert events transactionally in table events. ES on sql is simple, but it doesn't scale. ES on nosql is hard, but it scales. I was hoping pg_shard to have the best of both worlds, but it's more like nosql in this case.
I think voltdb is the only solution at the moment that scales like nosql and it is simple to implement.
More info about ES+CQRS:
Comment by ddorian Monday Dec 07, 2015 at 12:54 GMT
Note in voltdb you have transactions only in 1 db-call (ex sending multiple sql-statements in 1 string) and not several strings like you can do with plain postgresql/mysql. Meaning, session is autocommit. The presentation had too many buzzwords.
Comment by amiorin Monday Dec 07, 2015 at 13:07 GMT
@ddorian You are right about voltdb. Let's say it's possible to implement the optimistic lock in PL/pgSQL and pg_shard. I still have the problem that events_123 and optimistic_lock_123 tables are not on the same server. :-(
This is a very important feature to have. It seems to me that the query planner should be able to figure out which shards each record listed in the insert statement goes to, split them up to the appropriate shards, and pass the correct queries along. Is this not the case?
@brianbroderick In fact the the query planner already does this for SELECT queries. The problem is that many people expect SQL statements sent to a database based on postgres to have all the usual ACID properties. Correctly maintaining atomicity, among other things, is difficult when workers can fail in the middle of a transaction.
@trifthen -- thanks for your feedback on this issue.
We're working on prioritizing features for our v5.2 release; and I wanted to get your input on this. I'm reading through the issue's history, and it feels like you've run into three separate issues. Is my understanding below accurate?
INSERT
commands within transaction statements. This wasn't supported.INSERT
within a transaction block. This also wasn't supported.Among these three issues, which one was the most important for you?
With the older versions of pg_shard
, even a SELECT
within a transaction block would fail if it touched a table managed by pg_shard
. Citus didn't seem to have this issue, so I didn't really investigate further. The new bulk import mechanics work a lot better than the old ones, as the examples show. Creating a temp table, filling it, and pushing to a shard seem to operate at semi-normal speed, instead of a functional fraction thereof. Using that methodology, anyone can build an importer that loads in chunks and intermittently pings the shard creation function after a pre-determined amount of rows to keep things orderly. Plain COPY
seems much faster, too.
Given that, I'm not quite sure how to respond to this. My initial issues seem to have been addressed one way or another. I admit I didn't try an INSERT
within a transaction context during my latest round of testing, so if that's still not supported, that's pretty bad. We could theoretically get around that by making sure AUTOCOMMIT
is enabled everywhere that touches the proposed warehouse. It would be somewhat disconcerting knowing transactions are "broken" in some manner, though.
As of now, my main concerns are the broken schema support (#122), and lack of OUTER
query support (#26). I literally can't imagine warehouse queries that don't use LEFT JOIN
somewhere. That alone is a dealbreaker for our BI department.
Thanks! Better data ingest was our theme for v5.1; and plain COPY
support will hopefully make bulk loads faster and smoother. Our scheduled release data for v5.1 is within two weeks.
For broken schema support (#122), we'll likely prioritize this issue for v5.2. On #26, I realize just now that we haven't even documented what type of outer joins we support and don't (this also depends on the sharding type). We'll look to clarify that soon.
I'll keep this issue open for multi-statement transaction support. I'll also open another issue for single-statement transaction support -- it looks like several popular ORM tools generate BEGIN
/ END
statements. We may look to address this in v5.2 as well.
Thanks again for sharing your feedback with us!
I am also interested in the first use case, Multi-statement transactions targeting a single shard.
MySQL Percona does synchronous replication. My understanding is that each slave does some form of an ack that notifies the master if the insert was successful. Can this concept be applied here? In other words, each shard would ack that it was successful and if everything was successful, commit the transaction, otherwise rollback?
Which use-case are you more interested in?
- Multi-statement transactions targeting a single shard
Many databases has its own metadata required by application: users, groups, tariffs etc. This data is small and has no benefit from shards, but it is required by citus to put it into shards because of it not support mix sharded and not sharded tables in queries. But putting tables into shards has big disadvantage: transactions are lost. So such part of application like billing is not the place for citus cluster and will make pain to developers. Because they will be required split database into two: big data (citus), billing data (ordinal) and change application logic.
This is disadvantage for moving app to citus.
But this disadvantage will disappear if you allow transactions for tables with only one shard and one replica. :+1:
Multi-statement transactions spanning many shards +1
We had an internal conversation on providing distributed transaction support and broke down the work involved into three components. I'm documenting our categorization below.
Fixed by #1529
Multi-statement transactions that go to a single worker have been supported since 5.2. Multi-statement transactions across workers will be supported in Citus 7.0.
May you provide link to the documentation of this?
Issue by trifthen Thursday Mar 12, 2015 at 20:11 GMT _Originally opened as https://github.com/citusdata/pg_shard/issues/87_
Currently when attempting to execute a query within a transaction, the following output is produced:
Since transactions are not supported, and only single
INSERTS
are allowed, this makes data import incredibly slow. Further, this makespg_shard
completely unsuitable for use within a production environment where data integrity is important.