Consider making multi-tenant sharding explicit

marcocitus commented 7 years ago

We currently require users to add a tenant_id filters to all their queries, which significantly complicates the migration process. The client needs to specify the tenant_id in order to efficiently determine which shard(s) to use. However, that does not mean the tenant_id needs to always be specified in queries themselves.

We could introduce an explicit notion of a tenant that can be set before running a query. A table could be distributed by tenant even without having it as a column.

A possible implementation could then be to use a GUC to set the tenant.

CREATE TABLE hits (
  page_id int, 
  ip inet,
  hit_time timestamptz default now()
);
SELECT distribute_by_tenant('hits');

BEGIN;
SET citus.tenant TO 'citusdata.com';
INSERT INTO hits (page_id, ip) VALUES (3, '123.0.0.1');
END;

This means the client needs some specific logic to set the GUC at the start of the transaction, but it needs no other logic for migration or application conversion. An advantage is that the citus.tenant GUC can be coupled with a PostgreSQL role.

An alternative is to preserve the tenant_id column, but always fill it in based on the value of the GUC.

CREATE TABLE hits (
  tenant_id int not null,
  page_id int, 
  ip inet,
  hit_time timestamptz default now()
);
SELECT distribute_by_tenant('hits', 'tenant_id');

BEGIN;
SET citus.tenant TO 'citusdata.com';
INSERT INTO hits (page_id, ip) VALUES (3, '123.0.0.1');
END;

That way, commands like COPY and multi-shard SELECT can still work as expected.

lfittl commented 7 years ago

I think this is a very valuable train of thought, since application changes are currently our biggest blocker / effort required for multi-tenant use cases.

I wonder if we could do something similar for SELECTs as well, so you don't need to rewrite your queries to include the tenant_id anymore..

marcocitus commented 7 years ago

I think once you set citus.tenant, all queries prune down to the shard for that tenant, so additional filters or join requirements become unnecessary. Sharding by citus.tenant is like a lightweight way of sharding by schema, without all the downsides of sharding by schema.

The only case in which we wouldn't prune based on citus.tenant is DDL, but that would typically be done during a migration anyway.

COPY across multiple shards gets a bit trickier. You'd either have to do separate COPY commands for each tenant, or we could allow you to specify a tenant_id column in the COPY that is only used for pruning (with some hacks).

INSERT..SELECT across tenants and more generally colocated multi-shard joins also get a bit trickier. We could assume that all tables are implicitly joined on tenant ID, which is simple, but changes SQL semantics. Alternatively, we could require foreign keys on join columns, such that colocation can be inferred. For pure multi-tenant use-cases, not having multi-shard INSERT..SELECT or colocated multi-shard joins might also be acceptable.

aamederen commented 7 years ago

In my opinion this would make things super easy on the user end. Instead of fixing things on the application side like multitenant gem, this could fix it once and for all. Looks super promising.

However, in my opinion, putting something in the core of Citus would make it too multitenant-oriented for me. If I was a developer considering Citus for some other use case, seeing something like that would make me reconsider.

My suggestion is, we should make it a separate extension, just like the multitenant gem but lives in the database, instead of in the application.

marcocitus commented 7 years ago

We discussed this proposal last week.

One of the conclusions is that the approach where the tenant_id is still part of the schema is preferable. The approach that omits it entirely would not simplify migration and complicates COPY, multi-shard joins, and other operations.

Being able to omit the tenant ID from queries to speed up application migration appears useful. The GUC should be set at session level to also enable it for queries outside of a transaction block, and this should significantly simplify adding multi-tenancy support to ORMs.

However, using a sessino-level GUC becomes problematic when using pgbouncer in front of the coordinator, since pgbouncer does not guarantee that a SET citus.tenant TO 'foo' goes over the same server connection as a subsequent command coming from the same client connection. Client-side connection pools might introduce similar problems.

If the citus.tenant GUC is strictly coupled with a postgres role, then using a particular role always implies using a particular GUC value and there is no way for a connection pooler to mess up the GUC assignment. The problem is then one of efficient connection pooling with many roles. Pgbouncer keeps a separate pool for each database role, which makes it considerably less effective at reducing the number of connections. Client-side connection pools might have a similar problem.

An observation by @lfittl is that some Citus users want to use row-level security to isolate tenants, which also requires creating a postgres role for each tenant. To enable this, we would probably also want to add CREATE ROLE propagation to Citus.

mkurz commented 6 years ago

Hi Citus team,

I know this isn't a support channel, however I do have some questions related to this issue here. We a currently evaluating Citus because the cloud product you offer seems a promising choice in terms of scalability and reliability.

First I have some remarks in regards to statements which have been made in this thread already:

An observation by @lfittl is that some Citus users want to use row-level security to isolate tenants, which also requires creating a postgres role for each tenant.

Actually this statement isn't entirely true, row-level security in Postgres itself (available since v9.5) does not require you to create a postgres role for each tenant. You could also just use a GUC and set it at the beginning of a session (via SET SESSION) or for a transaction only (via SET LOCAL) - just like showed above in the first message of this thread - and later in the session (or the transaction) access that GUC inside a row-level security policy and based on the value of that GUC then filter rows accordingly.

However, using a sessino-level GUC becomes problematic when using pgbouncer in front of the coordinator, since pgbouncer does not guarantee that a SET citus.tenant TO 'foo' goes over the same server connection as a subsequent command coming from the same client connection.

AFAIK this is true only if you do not run pgbouncer in session pooling mode (but transaction or statement pooling mode). Even when using pgbouncer in transaction pooling mode you could still use SET LOCAL within a transaction and that should be fine. See this explanation from the postgres mailing list.

... The problem is then one of efficient connection pooling with many roles. Pgbouncer keeps a separate pool for each database role, which makes it considerably less effective at reducing the number of connections. Client-side connection pools might have a similar problem.

Could you not work around this by connecting to the coordinator with a single "super" user (like it's done now already anyway) and then just call SET ROLE someuser (and maybe at the end of a session/transaction RESET ROLE for clean up)? I did test this, however it seems that the when calling SET ROLE someuser that someuser not only has to exist in each worker db (of course) but also needs the LOGIN flag set within the workers... Why? Why not just also connect from the coordinator to the workers with a single "super" user and then also just propagate the SET ROLE someuser down to the workers? Why do you actually have to login in to the worker with the same role like set in the coordinator before? What's also weird is that in the coordinator I can create the role without the LOGIN flag set, however for the workers we have to set the LOGIN flag. Based on this observation following question comes to my mind now:

Does Citus use pgbouncer internally somewhere (no matter if your cloud solution or enterprise or the community edition)? (I am not talking about the pgbouncer exposed to applications at port 6432). Like between the coordinator and it's workers? If yes, in which mode (session or transaction pooling) is it operating? If not, how do you propagate the statements down to the workers from the coordinator? How many connections are open from the coordinator to each worker?

I also have another question related to pgbouncer:

In which pooling mode is pgbouncer, the one that is exposed to applications at port 6432 and usable by default, operating? Can we shut down that pgbouncer instance to reduce resources - does that even make sense? How can that pgbouncer instance be configured?

We do not plan to use pgbouncer since it has some disadvantages (not being able to use the prepared statement cache for example). Since our Java application uses HikariCP as connection pool already (which is known as a very solid and fast connection pool in the Java world) we don't need pgbouncer. Also using HikariCP gives us the advantage that we are able to use all the postgres and jdbc features available (like caching prepared statements). You could say in terms of pgbouncer HikariCP is working in "session" pooling mode (since it's guranteed you use the same connection from opening until closing it). However, even though we use HikariCP we also think about wrapping all our statements within a transaction anyway, so a transaction would equal a session in our application as well.

So in our application we are using row-level security right now, introduced in Postgres 9.5, to shard by tenant_id. We really like row-level security since we don't have to add the tenant_id part to each query, but can just set it via SET SESSION or even SET LOCAL (when using transactions for everything) and the row-level security policy just uses that GUC to figure out what to filter the rows for within the policy. (So, as you can see, we are not using postgres roles). As mentioned, since we probably always will use transactions and may just use SET LOCAL inside that transactions we then wouldn't even run into the problems related with sessions and pgbouncer (if we would decide to use pgbouncer instead of HikariCP or even if Citus uses pgbouncer internally somehow/somewhere).

After reading your blog post from @craigkerstiens, we are able to use run_command_on_shards to propagate the enable row level security... statements to tables in the workers and also use that command to propagate the needed create policy... statements down to the workers.

So basically with our setup (using rls) we already solved the problem you are trying to tackle with this issue here (not having the need that tenant_id is part of each query).

Or do you see any problems with our setup which I am not aware of? Does Citus work fine with our way of using row-level security (setting GUC, using that GUC in the policy - instead based on the current user)?

I have one last question which isn't related to this issue here:

Does Citus maintain a prepared statement cache (like when using a "normal" postgres instance together with the pgjdbc driver)? I know Citus works with prepared statements, but I mean does it actually cache the statements (like when using PREPARE?)? If so does this happen on the workers or only on the coordinator or on both?

Thank you very much for your time and the great work you do in pushing Citus forward!

mkurz commented 6 years ago

I did find an answer to one of my questions I asked in my last comment:

Does Citus use pgbouncer internally somewhere (no matter if your cloud solution or enterprise or the community edition)? (I am not talking about the pgbouncer exposed to applications at port 6432). Like between the coordinator and it's workers? If yes, in which mode (session or transaction pooling) is it operating? If not, how do you propagate the statements down to the workers from the coordinator? How many connections are open from the coordinator to each worker?

Answer from this comment by @marcocitus:

Since Citus Cloud has pgbouncers in transaction pooling mode in between the coordinator and the workers, ...

mkurz commented 6 years ago

You can simply fix this issue by using Postgres built-in row-level security and propagating SET LOCAL statements to the workers. This way you can achieve exactly what you want to solve here by solely using Postgres features that are available today already (version 9.5 and above). See this comment for further explanation and an example.

marcocitus commented 4 years ago

Looked into this a bit more, but got bitten by doing shard pruning in the planner.

When doing something like the following:

SET citus.tenant_id TO 'tenant1'; 
PREPARE e AS SELECT * FROM table;
EXECUTE e;
-- results for tenant1
SET citus.tenant_id TO 'tenant2'; 
EXECUTE e;
-- no results

Regardless of the implementation details, the query will not be re-planned for the second execution. That means that if we make a pruning decision in the planner based on the value of citus.tenant_id during the first execution, we will send the query to the wrong shard in the second execution.

The strategy can work if the application does not use prepared statements, but otherwise it depends on moving shard pruning to the executor.

mkurz commented 4 years ago

@marcocitus Does this also happen when using SET LOCAL within a transaction?

marcocitus commented 4 years ago

Yes, it's mainly broken for prepared statements, regardless of how you do the SET.

Things are changing a bit in Citus 9.2 though. Simple queries (without joins/subqueries) now do shard pruning in the executor.

mkurz commented 4 years ago

@marcocitus Thanks for letting me know. However, just to makes sure, if I do the SET only once within a single transaction, but there are several transaction one-after-another send via the same connection, it works as expected?

BEGIN
SET LOCAL citus.tenant_id TO 'tenant1'; 
PREPARE e AS SELECT * FROM table;
-- results for tenant1
EXECUTE e;
COMMIT -- or ROLLBACK, should not matter
BEGIN
SET LOCAL citus.tenant_id TO 'tenant2'; 
PREPARE e AS SELECT * FROM table;
-- results for tenant2? correct?
EXECUTE e;
COMMIT -- or ROLLBACK, should not matter

That would our scenario be, because when using connection pooling we would of course not close a connection between sending transactions.

citusdata / citus

Consider making multi-tenant sharding explicit #1374