Closed serprex closed 5 years ago
Can we add regression tests (or point to existing regression tests) for the things you've tried to #2844?
Maybe also add VACUUM
's new options TRANCATE/index_cleanup
to not lose track
We should probably add some tests with explicitly MATERIALIZED ctes, both for router queries that @serprex has worked on the previous release or multi-shard queries
WITH s1 AS MATERIALIZED (SELECT FROM users_table WHERE user_id = 1) SELECT FROM s1;
and
WITH s1 AS MATERIALIZED (SELECT FROM users_table WHERE user_id = 1) SELECT FROM s1;
just for completeness
I've went over the new features, please see my [IMPORTANT] tags below, we should address those.
Most of my comments below are some examples which I think valuable to have as a regression test. They are just sanity checks that things work fine.
0) WHERE clause in COPY FROM
: [IMPORTANT]: As I noted in the PR, we should error out for now because it is behaving wrong at the moment. https://github.com/citusdata/citus/pull/2844/files#r314717100
1) REINDEX CONCURRENTLY
CREATE TABLE test (x INTEGER);
INSERT INTO test SELECT generate_series(1, 1000);
SELECT create_distributed_table('test', 'x');
set citus.log_remote_commands TO ON;
SET client_min_messages TO DEBUG4;
CREATE INDEX i_test ON test (x);
---- [IMPORTANT]: Citus does not propage REINDEX CONCURRENTLY!!!!
-- this only happens on the coordinator table
-- and we should probably either error out or fully implement
REINDEX INDEX CONCURRENTLY i_test;
REINDEX TABLE test;
REINDEX SCHEMA public;
REINDEX DATABASE postgres;
REINDEX DATABASE "ec2-user";
2) plan_cache_mode and prepared statements
CREATE TABLE test_table (id int, sum int, avg float);
SELECT create_distributed_table('test_table', 'id');
INSERT INTO test_table SELECT (i * random())::int % 10000, (i * random())::int % 10000,
((i * random())::int % 10000) * random() FROM generate_series(0, 1000000) i;
PREPARE p1 (INTEGER) AS SELECT sum(sum) FROM test_table WHERE id = $1;
SET plan_cache_mode TO 'auto';
EXECUTE p1(1);
EXECUTE p1(2);
EXECUTE p1(3);
EXECUTE p1(4);
EXECUTE p1(5);
EXECUTE p1(6);
EXECUTE p1(6);
SET plan_cache_mode TO 'force_generic_plan';
-- OOPS FAILURE
-- [IMPORTANT]: Can we not error out here?
EXECUTE p1(1);
ERROR: could not create distributed plan
DETAIL: Possibly this is caused by the use of parameters in SQL functions, which is not supported in Citus.
HINT: Consider using PL/pgSQL functions instead.
SET plan_cache_mode TO 'force_custom_plan';
EXECUTE p1(1);
EXECUTE p1(2);
EXECUTE p1(3);
EXECUTE p1(4);
EXECUTE p1(5);
EXECUTE p1(6);
EXECUTE p1(6);
3)CTE Materialization: I think you should consider adding these tests or others to the regression tests
-- some sample tables
CREATE TABLE single_hash_repartition_first (id int, sum int, avg float);
CREATE TABLE single_hash_repartition_second (id int, sum int, avg float);
CREATE TABLE ref_table (id int, sum int, avg float);
SELECT create_distributed_table('single_hash_repartition_first', 'id');
SELECT create_distributed_table('single_hash_repartition_second', 'id');
SELECT create_reference_table('ref_table');
INSERT INTO single_hash_repartition_first SELECT (i * random())::int % 10000, (i * random())::int % 10000,
((i * random())::int % 10000) * random() FROM generate_series(0, 1000000) i;
INSERT INTO single_hash_repartition_second SELECT (i * random())::int % 10000, (i * random())::int % 10000,
((i * random())::int % 10000) * random() FROM generate_series(0, 1000000) i;
INSERT INTO ref_table SELECT (i * random())::int % 10000, (i * random())::int % 10000,
((i * random())::int % 10000) * random() FROM generate_series(0, 1000000) i;
-- a sample router query with NOT MATERIALIZED
-- which pushes down the filters to the CTE
EXPLAIN ANALYZE WITH cte1 AS NOT MATERIALIZED
(
SELECT
id
FROM
single_hash_repartition_first t1
)
SELECT
count(*)
FROM
cte1, single_hash_repartition_second
WHERE
cte1.id = single_hash_repartition_second.id AND single_hash_repartition_second.id = 45;
-- same query, without NOT MATERIALIZED, which is already default
-- which pushes down the filters to the CTE
EXPLAIN ANALYZE WITH cte1 AS
(
SELECT
id
FROM
single_hash_repartition_first t1
)
SELECT
count(*)
FROM
cte1, single_hash_repartition_second
WHERE
cte1.id = single_hash_repartition_second.id AND single_hash_repartition_second.id = 45;
-- same query, with MATERIALIZED
-- which prevents pushes down the filters to the CTE
-- and hence becomes real-time query
EXPLAIN ANALYZE WITH cte1 AS MATERIALIZED
(
SELECT
id
FROM
single_hash_repartition_first t1
)
SELECT
count(*)
FROM
cte1, single_hash_repartition_second
WHERE
cte1.id = single_hash_repartition_second.id AND single_hash_repartition_second.id = 45;
-- similar query, with MATERIALIZED
-- now manually have the same filter in the CTE
-- and hence becomes router query again
EXPLAIN ANALYZE WITH cte1 AS MATERIALIZED
(
SELECT
id
FROM
single_hash_repartition_first t1
WHERE
id = 45
)
SELECT
count(*)
FROM
cte1, single_hash_repartition_second
WHERE
cte1.id = single_hash_repartition_second.id AND single_hash_repartition_second.id = 45;
-- now, have a real-time query w/wout MATERIALIZED
-- these are sanitiy checks, because all of the CTEs are recursively
-- planned and there is no benefit that Citus can have
EXPLAIN ANALYZE WITH cte1 AS MATERIALIZED
(
SELECT
id
FROM
single_hash_repartition_first t1
WHERE
sum = 45
)
SELECT
count(*)
FROM
cte1, single_hash_repartition_second
WHERE
cte1.id = single_hash_repartition_second.id AND single_hash_repartition_second.sum = 45;
EXPLAIN ANALYZE WITH cte1 AS NOT MATERIALIZED
(
SELECT
id
FROM
single_hash_repartition_first t1
WHERE
sum = 45
)
SELECT
count(*)
FROM
cte1, single_hash_repartition_second
WHERE
cte1.id = single_hash_repartition_second.id AND single_hash_repartition_second.sum = 45;
4) Partitioned tables can now be referenced as foreign keys: Can we just add regression tests to make sure that everything works fine
CREATE TABLE collections_list (
key bigint,
ts timestamptz,
collection_id integer,
value numeric,
PRIMARY KEY(key, collection_id)
) PARTITION BY LIST (collection_id );
CREATE TABLE collections_list_0
PARTITION OF collections_list (key, ts, collection_id, value)
FOR VALUES IN ( 0 );
CREATE TABLE collection_users
(used_id integer, collection_id integer, key bigint);
ALTER TABLE collection_users
ADD CONSTRAINT collection_users_fkey FOREIGN KEY (key, collection_id) REFERENCES collections_list (key, collection_id);
-- sanity check for postgres
INSERT INTO collections_list VALUES (1,now(), 0, '1.1');
INSERT INTO collection_users VALUES (1, 0, 1);
-- should fail because of fkey
INSERT INTO collection_users VALUES (1, 1000, 1);
SELECT create_distributed_table('collections_list', 'key');
SELECT create_distributed_table('collection_users', 'key');
-- some more tests to ensure fkeys exists
5) The commands are COMMIT AND CHAIN and ROLLBACK AND CHAIN: I think we should consider adding the following or similar regression tests
CREATE TABLE test (x INTEGER, y int);
INSERT INTO test (x,y) SELECT i,i from generate_series(1, 1000) i;
SELECT create_distributed_table('test', 'x');
-- single shard queries with CHAIN
BEGIN;
UPDATE test SET y = 15 WHERE x = 1;
COMMIT AND CHAIN;
SELECT * FROM test WHERE x = 1;
COMMIT;
BEGIN;
UPDATE test SET y = 20 WHERE x = 1;
ROLLBACK AND CHAIN;
SELECT * FROM test WHERE x = 1;
COMMIT;
-- multi shard queries with CHAIN
BEGIN;
UPDATE test SET y = 25;
COMMIT AND CHAIN;
SELECT DISTINCT y FROM test;
COMMIT;
BEGIN;
UPDATE test SET y = 30;
ROLLBACK AND CHAIN;
SELECT DISTINCT y FROM test;
COMMIT;
6) Allow RECORD and RECORD[] to be used as column types in a query's column definition list for a table function that is declared to return RECORD: Well, we error out, but I don't think this is easy to fix.
select * from test where row(y) = ANY( ARRAY[row(3)]);
ERROR: input of anonymous composite types is not implemented
CONTEXT: while executing command on 10.192.0.200:5432
7) Add CREATE ACCESS METHOD command to create new table types
We error out on CopyLocalDataIntoShards(). I'd prefer to have an explicit check on create_distributed_table(). Now, we're relying on some unrelated function to error out. As you've noted above
8) Add support function capability to improve optimizer estimates, inlining, and indexing for functions: I couldn't find any examples. But, that'd be nice to have one example where we use such functions both on the coordinator plan and on the worker plan, just a sanity check that nothing brakes.
9) Allow pg_stat_statements_reset() to be more granular
Once merged to citus-enterprise, make sure that it doesn't break citus_stat_statements.
10) GENERATED ALWAYS AS [IMPORTANT]: https://github.com/citusdata/citus/pull/2844/files#r314716706
Can we error out on create_distributed_table ()
. Because it is not a nice UX where we error something really weird coming from the worker cannot use column reference in DEFAULT expression
.
REINDEX
is being propagated at all, will look intoREAD ONLY
test might be too flimsycreate_distributed_table
to error out sooner with a clear messagetl;dr outstanding: 1,8,9 (2 & 6 I believe to be non-issues)
- I don't think REINDEX is being propagated at all, will look into
We should do that, users are excited about REINDEX, so this is probably one of the first things that people try out. As usual, I'm OK with erroring out for now, and implement it later.
- SET plan_cache_mode TO 'force_generic_plan';
I think differently here. Citus knows how to handle both generic and custom plans. So, we shouldn't be erroring out. Some pointers in the code for you to read about these concepts 1, 2
I think it is OK to leave this for now, but before 9.0, I think we should support that.
- Optimizer support functions
I can look into that, I'm curious to learn about them anyway.
- Don't understand
On Citus enterprise, there is a view called citus_stat_statements
, which relies on pg_stat_statements. Given that there are some changes on the pg_stat_statements_reset ()
, we should also ensure that citus_stat_statements
is not broken with pg 12. This could be as simple as adding a new regression test on citus-enterprise
once the changes are merged to citus
. My initial understanding is that there is nothing dangerous about the changes, but still should be verified.
Closing this issue, 8/9 are lacking their own issues, but no longer need a centralized tracking issue
PRs:
GENERATE ALWAYS AS STORED
COPY FROM WHERE
Test new features:
WHERE
clause inCOPY FROM
COPY
ignoring theWHERE
. Need to updateCheckCopyPermissions
. AddwhereClause
tocopyState
fromBeginCopyFrom
REINDEX CONCURRENTLY
CONCURRENTLY
it'd blockREINDEX CONCURRENTLY
COMMIT AND CHAIN
/ROLLBACK AND CHAIN
ROLLBACK
variant seems to not work in general, even without citus installed at allVACUUM TRUNCATE INDEX_CLEANUP SKIP_LOCKED
commands/vacuum.c