saicitus commented 8 years ago

Steps

CREATE TABLE t_left(val int);
select master_create_distributed_table('t_left','val','hash');
select master_create_worker_shards('t_left',16,1);
INSERT INTO t_left VALUES(1);
INSERT INTO t_left VALUES(2);
INSERT INTO t_left VALUES(3);
INSERT INTO t_left VALUES(4);
INSERT INTO t_left VALUES(5);

CREATE TABLE t_right(val int);
select master_create_distributed_table('t_right','val','hash');
select master_create_worker_shards('t_right',16,1);
INSERT INTO t_right VALUES(2);
INSERT INTO t_right VALUES(3);
INSERT INTO t_right VALUES(4);

select * from t_left as l LEFT JOIN t_right as r ON l.val=r.val where r.val is NULL;

Expected

val	val
1	∅
5	∅

(2 rows)

Actual

val	val
1	∅
5	∅
4	∅
3	∅
2	∅

(5 rows)

marcocitus commented 8 years ago

We made some late changes in outer joins to translate anti-joins back to left joins, but for hash-distributed tables PruneShardList now wrongly interprets r.val IS NULL as a column filter and prunes all the shards except the one which contains hash(NULL). A quick workaround is to use IS NULL on a different column.

A possible fix would be to disregard <partition key> IS NULL clauses during shard pruning.

marcocitus commented 8 years ago

The reason that this is not caught in the regression tests, which include anti-joins with correct results, is that shard pruning appears to work differently for hash-distributed and append-distributed tables regarding IS NULL clauses an the regression tests use append-distributed tables.

sumedhpathak commented 8 years ago

Assigning to @jasonmp85

jasonmp85 commented 8 years ago

OK, so we've been doing some forensics to figure out (a) why NULL is mapped to partition zero, (b) what code paths are affected, and (c) whether this is a "feature" we actually need.

@lithp started out the conversation with:

Making it consistent by not transforming IS NULL to = 0 sounds sensible. Maybe we could also require that the partition column be NOT NULL so the requirement we're enforcing surfaces to users sooner.

I agreed, but had some vague memory of this being an intentional change. @lithp thought this emanated from the original pg_shard codebase, but hit a dead end at the point of the initial pg_shard open-source import from our closed-source repo.

After poking around historic copies of various repos, I think the chronology was:

I was working on pg_shard's shard pruning logic, but we had changes from Citus proper we wanted to ensure would be 100% compatible
@metdos was making those changes in Citus. The code review was largely stable, but not yet merged. Based on discussions, we copied some of the logic from that review into pg_shard, before it ever actually made it into Citus
That code review was closed and Citus (largely) also got the same logic for hash partition pruning

The net result of the above is that it looks like this hit Citus after pg_shard, even though it originally came from a code review opened against Citus.

After remembering those events, I think this change was introduced to ensure the hash partition pruning algorithm remained identical to existing behavior in hash repartition JOINs, which return 0 for NULL columns. I also (erroneously) believed that NULL values could be INSERTed to distributed tables, something @marcocitus had already pointed out as false.

@metdos chimed in with:

As Jason noted, I wrote shard pruning on the hash-partitioned tables. At that time, we decided to map NULL values to 0. I don't remember any discussion about not allowing NULL values at that time.

This morning, @marcocitus reiterated that NULL values were never permitted by our INSERT pipeline:

We already prevent inserting NULL values for the partition column into hash-distributed tables.
postgres=# INSERT INTO test VALUES (NULL);
ERROR:  cannot plan INSERT using row with NULL value in partition column

I still need to verify that e.g. copy also rejects NULL partition values, but if it does, this removes the ingest objection to just removing @lithp's original suggestion.

With that reminder, I still thought the repartitioning logic might use the codepath in question. Turns hash repartition joins (worker_hash_partition_table) just calls the hash function for the column in question, using % to "bucket" the values into hash bins. This is substantially different than the old pg_shard code, as referenced in #129 (which @sumedhpathak pointed out this morning).

So (assuming the copy path also forbids NULL values), all the obstacles to the "simple" approach have been removed.

jasonmp85 commented 8 years ago

For future reference, the alternative way of addressing this bug would be to not push down IS NULL constraints during outer joins (instead applying them on the master during the final pass). I believe something could be modified during this section to accomplish this. Then we'd be able to preserve NULLability of partition columns, if we desired.

But since that's already not permitted, it seems like no problem to go with @lithp's original suggestion.

jasonmp85 commented 8 years ago

OK, copy also forbids NULL in the partition column:

ERROR:  22004: cannot copy row with NULL value in partition column
CONTEXT:  COPY test, line 1: "NULL"

citusdata / citus

Hash-distributed anti-joins return wrong results #628

Steps

Expected

Actual