citusdata / citus_docs

Documentation for Citus. Distributed PostgreSQL as an extension.
Creative Commons Attribution 4.0 International
58 stars 59 forks source link

How to create view containing distributed query #227

Closed begriffs closed 6 years ago

mtuncer commented 7 years ago

Using views with Citus

Views provide a level of abstraction to a query. Once could make a frequently used query to be a view to avoid repeating the same expression. Its usage is similar to use of common table expressions or subqueries, except that you do not have to define a view for each query. They are used as if they are just a different table.

Why are we implementing it?

Views behave just like subqueries. We do support subqueries if the overall query is router plannable. Therefore supporting views is a natural expansion of router planner. Supporting views in some form extends or SQL coverage.

What are the typical use cases?

User has a filter or aggregate query frequently used as a part of other queries. Abstracting this partial query could cut down complexity of queries and may ease creation and maintenance of them.

Postgresql replaces view references in the query with the actual definitions of views. Since Citus has partial support for subqueries, view support is also limited.

A typical use case is query involving view being router plannable. This way whole query could be evaluated via single worker query without any restrictions on view definition. Once can select from a view, have set operations or joins using views, perform aggregations on views as long as overall query is router plannable. Filter clauses on actual relations could be inside view definition, or can be a clause on view in outer query.

If overall query is not router plannable, then view support diminishes a bit. It is evaluated same as a subquery would be. We rely on Postgresql's ability to flatted whole query with view into a single level query to be able to support it.

If a view contains an aggregate or group by then Postgresql would keep the query as it is an our planner would reject it saying subqueries are not supported. Please see Gotchas section.

Materialized views on distributed tables could be created if view query is router plannable. However, they can not be used in distributed queries afterwards. User can only use them if they are local tables.

Communication goals

I think we should focus on simplifying query writing, and overall query being router plannable.

Good locations for content in docs structure

Do not have strong opinion here.

How does this work?

We do not do additional view specific work, we rely on postgresql to replace view references with its subquery definition, and evaluate the query afterwards. Subquery processing rules apply once a view is received by our planner.

Example sql

Creating a view on a distributed table is the same as creating view on a regular table. Like

create view tenant_25_orders as select * from orders where tenant_id = 25;

Later you can use this view like

select count(*) from tenant_25_orders;

You can join a view and a table

select e.* from tenant_25_orders t25o join events e 
    on ( t25o.tenant_id = e.tenant_id and t25o.order_id = e.order_id)

You can join between views

create view high_priority_events as select * from events where priority = 'HIGH';

select hpe.* from tenant_25_orders  t25o join high_priority_events hpe
    on (t25o.tenant_id = hpe.tenant_id);

A view can have aggregates and joins inside as long as it is a router plannable query.

create view t25_daily_web_order_count as
    select orders.tenant_id, order_date, count(*) from orders join events on (orders.tenant_id = events.tenant_id)
        where orders.tenant_id = 25 and  orders.order_type = 'WEB'
        group by orders.tenant_id, order_date order by count(*) desc limit 10;

select * from t25_daily_web_order_count ;

Notice that

  1. there is an implicit (in the view) or explicit (in the query) filter on partition column
  2. if there is a join, it is on partition column on actual tables

Gotchas

Full view support is limited to router plannable queries. If some reason user changes some filter and the query no longer is router plannable, Citus throws some subquery errors.

There is known issue with queries with views involving single tables. Citus supports subqueries if it is a funnel query on a single table. However, if user writes the same query using a view, Citus errors out due to a bug in the code. It should be fixed in later releases.

There is a known issue with views containing group by. We rely on postgresql's ability to flatten the query as possible, having a group by inside the view prevents such optimization, this also prevents identification of router plannable queries. If a view contains group by, it is advised that it should also have a filter on partition column.

Link to relevant commits and regression tests if applicable

Commit : citusdata/commit@77f8db6

Tests : multi_view.sql

begriffs commented 7 years ago

I'm having trouble finding views that break, actually. For instance:

CREATE TYPE prio AS ENUM ('low', 'med', 'high');
SELECT run_command_on_workers ($cmd$
  CREATE TYPE prio AS ENUM ('low', 'med', 'high');
$cmd$);

CREATE TABLE todos (
    id serial,
    project_id int,
    priority prio DEFAULT 'med',
    what text NOT NULL,
    PRIMARY KEY (id, project_id)
);

INSERT INTO todos (project_id, what)
SELECT random() * 4, 'usual thing'
FROM generate_series(1, 100);

INSERT INTO todos (project_id, what, priority)
SELECT random() * 4, 'important thing', 'high'
FROM generate_series(1, 10);

SELECT create_distributed_table ('todos', 'project_id');

-- now the views --------------------------------------

CREATE VIEW important_todos AS
SELECT *
FROM todos
WHERE priority >= 'high';

CREATE VIEW prio_count AS
SELECT priority, count(*)
FROM todos
GROUP BY priority;

The first runs fine despite not being router executable:

❯ explain select * from important_todos;
┌──────────────────────────────────────────────────────────────────────────────────┐
│                                    QUERY PLAN                                    │
├──────────────────────────────────────────────────────────────────────────────────┤
│ Custom Scan (Citus Real-Time)  (cost=0.00..0.00 rows=0 width=0)                  │
│   Task Count: 32                                                                 │
│   Tasks Shown: One of 32                                                         │
│   ->  Task                                                                       │
│         Node: host=localhost port=5433 dbname=postgres                           │
│         ->  Seq Scan on todos_102361 todos  (cost=0.00..24.12 rows=377 width=44) │
│               Filter: (priority >= 'high'::prio)                                 │
└──────────────────────────────────────────────────────────────────────────────────┘
(7 rows)

❯ select * from important_todos;
┌─────┬────────────┬──────────┬─────────────────┐
│ id  │ project_id │ priority │      what       │
├─────┼────────────┼──────────┼─────────────────┤
│ 104 │          4 │ high     │ important thing │
│ 108 │          0 │ high     │ important thing │
│ 101 │          3 │ high     │ important thing │
│ 103 │          3 │ high     │ important thing │
│ 105 │          3 │ high     │ important thing │
│ 109 │          3 │ high     │ important thing │
│ 102 │          2 │ high     │ important thing │
│ 106 │          2 │ high     │ important thing │
│ 107 │          2 │ high     │ important thing │
│ 110 │          2 │ high     │ important thing │
└─────┴────────────┴──────────┴─────────────────┘

The second one is the only one that's causing an error, but I'm not sure it's the type of error we were anticipating:

❯ select * from prio_count ;
ERROR:  Unrecognized range table id 1
STATEMENT:  select * from prio_count ;
ERROR:  XX000: Unrecognized range table id 1
LOCATION:  NewTableId, multi_physical_planner.c:1547
sumedhpathak commented 7 years ago

Do we expect views to work with the real-time executor? @begriffs based on your investigation, did you document only the workaround because most examples just worked?