Open proddata opened 4 years ago
The described workaround (CREATE TABLE / INSERT INTO
) has some flaws:
INSERT INTO
A workaround for "simple" materialization would be:
/* CREATE MATERIALIZED VIEW my_table_rollup AS
SELECT
device_id,
date_trunc('week',ts) week,
SUM(val) avg_val
FROM my_table
GROUP BY 1, 2;
*/
CREATE TABLE my_table_rollup AS
SELECT
device_id,
date_trunc('week',ts) week,
SUM(val) avg_val
FROM my_table
GROUP BY 1, 2;
and then update via
/* REFRESH MATERIALIZED VIEW my_table_rollup; */
CREATE TABLE temp_my_table_rollup AS
SELECT
device_id,
date_trunc('week',ts) week,
AVG(val) avg_val
FROM my_table
GROUP BY 1, 2;
REFRESH TABLE temp_my_table_rollup;
ALTER CLUSTER SWAP TABLE temp_my_table_rollup TO my_table_rollup WITH ("drop_source" = true);
While this might not look as too much effort in the first place, managing especially the update part is rather cumbersome, especially when one has to deal with updating 10,20,30, ... "views" with some sort of cron job.
Replacing view definitions, would not only need the change in the "view" (table), but also the mechanism that updates it accordingly. Especially in organisation, where this might not be handled by the same person, this becomes tedious.
Operations like snapshots or other blocking operations that would not allow dropping tables would break this
a better approach might be to use a partitioned table with a view on top of it, which gets updated only after the insert into a new partition was successful.
for a view like
CREATE VIEW my_table_rollup AS
SELECT
device_id,
date_trunc('week',ts) week,
SUM(val) avg_val
FROM my_table
GROUP BY 1, 2;
one could use the following approach:
Gather the view definition of the original view using the http endpoint with 127.0.0.1:4200/_sql?types
to recover column names and types:
SELECT * FROM my_table_rollup LIMIT 0;
Create a partitioned table with an update
field, which is used for partitioning:
CREATE TABLE my_table_rollup_materialization (
update TIMESTAMP,
device_id TEXT,
week TIMESTAMP,
avg_val DOUBLE
) PARTITIONED BY(update);
create timestamp e.g. ts_now
Update data from the original view:
INSERT INTO my_table_rollup_materialization SELECT <ts_now>, device_id, week, avg_val FROM my_table;
refresh the table
REFRESH TABLE my_table_rollup_materialization;
Update/create the "materialized" view, only if the insert was successful:
CREATE OR REPLACE VIEW my_table_rollup_materialized AS
SELECT
device_id,
week,
avg_val
FROM my_table_rollup_materialization
WHERE update = <ts_now>;
Delete all old records from the materialization table:
DELETE FROM my_table_rollup_materialization WHERE update < <ts_now>
... of course it would be much nicer to have syntactic sugar for this within CrateDB, as this approach also doesn't really work with user privileges
Another option would be the implementation of something like a MERGE statement, i.e. recalculate the new data in the source table and merge it into a target table:
https://docs.oracle.com/en/database/oracle/oracle-database/12.2/sqlrf/MERGE.html
MERGE INTO bonuses D
USING (SELECT employee_id, salary, department_id FROM hr.employees
WHERE department_id = 80) S
ON (D.employee_id = S.employee_id)
WHEN MATCHED THEN UPDATE SET D.bonus = D.bonus + S.salary*.01
DELETE WHERE (S.salary > 8000)
WHEN NOT MATCHED THEN INSERT (D.employee_id, D.bonus)
VALUES (S.employee_id, S.salary*.01)
WHERE (S.salary <= 8000);
Another option would be the implementation of something like a MERGE statement, i.e. recalculate the new data in the source table and merge it into a target table:
Is not really an alternative to materialized views, but rather to INSERT INTO ... SELECT ... ON CONFLICT SET ...
which already cover a lot of the MERGE
use cases.
Draft proposal following PostgreSQL beahviour ⚠️ Feel free to edit ⚠️ No decision to implement
The propose approach largely follows the PostgreSQL implementation of materialized views.
CREATE MATERIALIZED VIEW
is similar toCREATE TABLE AS
, except that it also remembers the query used to initialize the view, so that it can be refreshed later upon demand. A materialized view has many of the same properties as a table, but there is no support for temporary materialized views. Criteria
CREATE MATERIALIZED VIEW my_mat_view AS SELECT * FROM my_table;
SELECT
query is executed and used to populate the view right after the view is created.REFRESH MATERIALIZED VIEW my_mat_view;
CREATE MATERIALIZED VIEW [ IF NOT EXISTS ] table_name
[ (column_name [, ...] ) ]
[ WITH ( storage_parameter [= value] [, ... ] ) ]
AS query
[ WITH [ NO ] DATA ]
Another option would be the implementation of something like a MERGE statement, i.e. recalculate the new data in the source table and merge it into a target table:
Is not really an alternative to materialized views, but rather to
INSERT INTO ... SELECT ... ON CONFLICT SET ...
which already cover a lot of theMERGE
use cases.
Always something that could be positioned as "on-demand materialized view" and gives a good amount of freedom what do with the data in the "materialized view".
Being able to define an automatic refresh interval (once every 15 mins / once every hour) would be really helpful. Being able to tweak the refresh rate w/o having to destroy & recreate the materialized view would also be great.
I could really use materialized views for complicated computations that are used repeatedly in other queries. Could you please make this a priority in 5.8?
Thank you for the feedback @helenap , while waiting for this to be implemented I would suggest either using the new SQL job scheduler in CrateDB Cloud or dbt's table/incremental materializations.
Problem Statement
We have an application with a dashboard component and an ingestion layer.
Possible solutions
Considered alternatives
Optimizing the query, but there is nothing left to squeeze out. It processes billions of records after all
Using a manual variant of a materialized view:
insert into <tbl> (...) select (...) on conflict (...) do update set ...
This works, but is more difficult to use:
set
assignmentson conflict
only works on primary key columns, not on arbitrary columns, so you have to define these columns as primary key.Other downsides:
To address the last point, one can use a temporary table, insert into and swap tables, but that's even more tedious to use and shares the other downsides.
Discussion remarks
Options for refresh:
Resources
Prerequisite