citusdata / pg_shard

ATTENTION: pg_shard is superseded by Citus, its more powerful replacement
https://github.com/citusdata/citus
GNU Lesser General Public License v3.0
1.06k stars 63 forks source link

DISTINCT and JSON field selection #111

Closed wtfuzz closed 9 years ago

wtfuzz commented 9 years ago

I've been loading sample twitter data into a jsonb column (data), and was trying to get distinct values from a key.

Running this on the master

select distinct data->'user'->>'screen_name' from twitter;

Causes the following query to be run on the workers by watching pg_stat_activity

SELECT data FROM ONLY <shard> WHERE true

I would think it would be possible to pass distinctClause down to the workers, but it is also retrieving the entire data column, and not just the particular field.

For fun, I made this change which causes the DISTINCT to make its way to the workers:

diff --git a/pg_shard.c b/pg_shard.c
index 31ff752..17f49da 100644
--- a/pg_shard.c
+++ b/pg_shard.c
@@ -789,6 +789,7 @@ RowAndColumnFilterQuery(Query *query, List *remoteRestrictList, List *localRestr
        filterQuery->rtable = rangeTableList;
        filterQuery->jointree = fromExpr;
        filterQuery->targetList = targetList;
+        filterQuery->distinctClause = copyObject(query->distinctClause);

However, this obviously just does:

SELECT DISTINCT data FROM ONLY <shard> WHERE true

instead of data->'users'->>'screen_name'. I'm not super familiar with query internals yet to have it pass down the entire JSON field name, so was wondering if someone knew a quick way to do this.

gmcquillan commented 9 years ago

Unless you're doing a DISTINCT on the column being sharded specifically, wouldn't you have the problem of possibly merging duplicates for that column when the master assembles the return set?

onderkalaci commented 9 years ago

Hey @wtfuzz,

First, as @gmcquillan mentioned, if you're doing a DISTINCT on a non-partition column, its possible that some of the shards contain overlapping values for some of the values.

So, what you suggest is not safe to apply, we may end up with wrong results.

For non-partition columns, I cannot see any solutions without pulling all the data to the master. However, for partition column DISTINCT queries, it is possible to push down the query.

But in the near future, we do not plan such a feature to add pg_shard. You can check CitusDB, which is more powerful in terms of SELECT queries.

Thanks for reaching us! Onder