citusdata / citus

Distributed PostgreSQL as an extension
https://www.citusdata.com
GNU Affero General Public License v3.0
10.64k stars 670 forks source link

Reword "Shards of relations in outer join queries must have 1-to-1 shard partitioning" error message #310

Open onderkalaci opened 8 years ago

onderkalaci commented 8 years ago

Follow the steps below:

CREATE TABLE test_1 (key int, dummy int);
CREATE TABLE test_2 (key int, dummy int);
SELECT master_create_distributed_table('test_1', 'key', 'hash');
SELECT master_create_distributed_table('test_2', 'key', 'hash');
SELECT master_create_worker_shards('test_1', 4, 1);
SELECT master_create_worker_shards('test_2', 4, 1);

SELECT * FROM test_1 as a LEFT JOIN test_2 as b ON (a.dummy = b.dummy); 

I'd expect to see cannot run outer joins with re-partition jobs, but the actual message is: Shards of relations in outer join queries must have 1-to-1 shard partitioning, which is wrong since there is a 1-1 shard partitioning.

onderkalaci commented 8 years ago

One more note on this error message.

CREATE TABLE test_1 (key int, dummy int);
CREATE TABLE test_2 (key int, dummy int);
SELECT master_create_distributed_table('test_1', 'key', 'hash');
SELECT master_create_distributed_table('test_2', 'key', 'hash');
SELECT master_create_worker_shards('test_1', 4, 1);
SELECT master_create_worker_shards('test_2', 1, 1);

SELECT * FROM test_1 as a LEFT JOIN test_2 as b ON (a.key = b.key); 

Now we don't have 1-1 shard mappings, but the query works. So, the error message Shards of relations in outer join queries must have 1-to-1 shard partitioning seemed a bit confusing. Is it intentinal that we do not mention n-to-1 cases in the message?

marcocitus commented 8 years ago

Is it intentinal that we do not mention n-to-1 cases in the message?

Mostly because there's a lot of subtlety to it. n-to-1 is valid for left joins, but for right joins is 1-to-n, and for full joins 1-to-1 only. Then there's the fact that n-to-1 refers to shard counts and 1-to-1 refers to the number of overlapping shards for each shard. And what if you join 3 tables? The message describes the simple case, but I agree it's not great. Any thoughts on improving it?

ozgune commented 8 years ago

@onderkalaci -- could you copy/paste the exact error message that you see?

Is this issue related to the error message in #503?

ozgune commented 8 years ago

I'm copy/pasting an error message below that a user shared with us in private to capture more info. @marcocitus / @onderkalaci, how much effort is to make this error message more helpful to the user?

demo=# SELECT COUNT(t.id) FROM "event" "t" LEFT JOIN "sensor" ON sensor.id = t.sensor_id; ERROR: cannot perform distributed planning on this query DETAIL: Shards of relations in outer join queries must have 1-to-1 shard partitioning

ozgune commented 8 years ago

This particular error message came up again in a conversation last week. (+1)

@begriffs -- I also overheard that you were asking about the meaning of an error message related to joins. Was it this particular issue or the one mentioned in #577?

ozgune commented 8 years ago

This particular error message came up again in another conversation this week. (+1)

Hi - I've distributed two tables with the same method and same shards count, but are partitioned on separate keys. When I do a left join on these tables, I get an error: "Shards of relations in outer join queries must have 1-to-1 shard partitioning" . Can you please explain what am I doing incorrectly?

ozgune commented 8 years ago

@onderkalaci / @marcocitus -- We also had a few internal emails where users asked about this error message. Now that we have the notion of colocated tables built into Citus 6.0, is there a way to make this error message more helpful?

byucesoy commented 7 years ago

@ozgune I think using co-location related keywords would not help user to understand the problem in this case. We allow some outer joins even if tables are not co-located and we do not allow some outer joins even if tables are co-located.

I think Onder's suggestion is pretty good to explain the problem; cannot run outer joins with re-partition job

It is short and to the point. However the term "re-partition job" may not be meaningful for new users. Saying that, I cannot come up with better alternative.

@samay-sharma Do you have any suggestion?

samay-sharma commented 7 years ago

Hey @byucesoy , there seem to be 2 cases when we display this message.

One case is when the join type chosen is LOCAL_PARTITION_JOIN and !shardIntervalsMatch. Do you know what user behaviors / scenarios can lead us to this code path?

For the 2nd one, I think its better to show a different error message.

An error message like:- "cannot run outer join query which requires repartitioning" does seem like an improvement. Can we say something like below (i.e. is this accurate for scenarios when we hit this error)?

ERROR: cannot run outer join query if join is not on partition column
Detail: Outer joins requiring repartitioning are unsupported.

I'm trying to tie the error message to the user behavior, if possible, so that they know what to fix.

Also, I had a question. Is it possible that a user reached this as a result of missing out adding the partition column in the filter / join? If so, do we want to add a hint to add partition column to join clauses etc. like we are doing in #1074 ?

byucesoy commented 7 years ago

Hey @samay-sharma,

I think, we can use your suggestion for second usage of error message. In that case, error occurs because there is no partition column in the error message. However situation is different for first usage of the error message. I wrote an example;

SET citus.shard_replication_factor TO 1;

CREATE TABLE table1(column1 int);
CREATE TABLE table2(column1 int);

SELECT create_distributed_table('table1', 'column1', 'append');
SELECT create_distributed_table('table2', 'column1', 'append');

COPY table1 FROM STDIN;
1
\.

COPY table2 FROM STDIN;
1
\.

COPY table2 FROM STDIN;
1
\.

SELECT * FROM table1 FULL JOIN table2 ON table1.column1 = table2.column1;
ERROR:  cannot perform distributed planning on this query
DETAIL:  Shards of relations in outer join queries must have 1-to-1 shard partitioning

As you can see we can reach that path even if we join on partition column. In fact, we can only reach this path if there is a join on partition column. For this one, we can use something like this without mentioning partition column;

ERROR:  cannot perform distributed planning on this query
DETAIL:  Outer joins requiring repartitioning are unsupported.
byucesoy commented 7 years ago

There are two instances of this error message. With #1118, we changed one of them. The other one still needs some discussion.

Some queries we do not error out;

Some queries we error out;

CREATE TABLE hash_table(partition_column int);
CREATE TABLE reference_table(random_column int);

SELECT create_distributed_table('hash_table', 'partition_column');
SELECT create_reference_table('reference_table');

-- works
SELECT * FROM hash_table LEFT JOIN reference_table ON hash_table.partition_column = reference_table.random_column;

-- errors out
SELECT * FROM hash_table RIGHT JOIN reference_table ON hash_table.partition_column = reference_table.random_column;
SELECT * FROM hash_table FULL JOIN reference_table ON hash_table.partition_column = reference_table.random_column;

What do we mean by 1:1 matching?

/*
 * ShardIntervalsMatch returns true if provided shard interval has one-to-one
 * matching. Shards intervals must be not empty, and their intervals musht be in
 * ascending order of range min values. Shard interval ranges said to be matched
 * only if (1) they have same number of shards, (2) a shard interval on the left
 * side overlaps with corresponding shard on the right side, (3) a shard interval
 * on the right side does not overlap with any other shard. The function does not
 * compare a left shard with every right shard. It compares the left shard with the
 * previous and next shards of the corresponding shard to check they to not overlap
 * for optimization purposes.
 */

Also please note that we only perform this check if OUTER JOIN is a LOCAL_PARTITION_JOIN. Reference table joins are BROADCAST_JOIN therefore we do not perform this check for them, as a result OUTER JOINs containing reference table do not need to have same number of shards.