crate / crate

CrateDB is a distributed and scalable SQL database for storing and analyzing massive amounts of data in near real-time, even with complex queries. It is PostgreSQL-compatible, and based on Lucene.
https://cratedb.com/product
Apache License 2.0
4.12k stars 566 forks source link

Support for Materialized Views #10661

Open proddata opened 4 years ago

proddata commented 4 years ago

Problem Statement

We have an application with a dashboard component and an ingestion layer.

Possible solutions

Considered alternatives

This works, but is more difficult to use:

Other downsides:

To address the last point, one can use a temporary table, insert into and swap tables, but that's even more tedious to use and shares the other downsides.

Discussion remarks

Options for refresh:

Resources

Prerequisite

proddata commented 1 year ago

The described workaround (CREATE TABLE / INSERT INTO) has some flaws:

A workaround for "simple" materialization would be:

/* CREATE MATERIALIZED VIEW my_table_rollup AS
SELECT
    device_id,
    date_trunc('week',ts) week,
    SUM(val) avg_val
FROM my_table
GROUP BY 1, 2;
*/
CREATE TABLE my_table_rollup AS
SELECT
    device_id,
    date_trunc('week',ts) week,
    SUM(val) avg_val
FROM my_table
GROUP BY 1, 2;

and then update via

/* REFRESH MATERIALIZED VIEW my_table_rollup; */
CREATE TABLE temp_my_table_rollup AS
SELECT
    device_id,
    date_trunc('week',ts) week,
    AVG(val) avg_val
FROM my_table
GROUP BY 1, 2;
REFRESH TABLE temp_my_table_rollup;
ALTER CLUSTER SWAP TABLE temp_my_table_rollup TO my_table_rollup WITH ("drop_source" = true);

While this might not look as too much effort in the first place, managing especially the update part is rather cumbersome, especially when one has to deal with updating 10,20,30, ... "views" with some sort of cron job.

Replacing view definitions, would not only need the change in the "view" (table), but also the mechanism that updates it accordingly. Especially in organisation, where this might not be handled by the same person, this becomes tedious.

Operations like snapshots or other blocking operations that would not allow dropping tables would break this

proddata commented 1 year ago

a better approach might be to use a partitioned table with a view on top of it, which gets updated only after the insert into a new partition was successful.

for a view like

CREATE VIEW my_table_rollup AS
SELECT
    device_id,
    date_trunc('week',ts) week,
    SUM(val) avg_val
FROM my_table
GROUP BY 1, 2;

one could use the following approach:

  1. Gather the view definition of the original view using the http endpoint with 127.0.0.1:4200/_sql?types to recover column names and types:

    SELECT * FROM my_table_rollup LIMIT 0;
  2. Create a partitioned table with an update field, which is used for partitioning:

    CREATE TABLE my_table_rollup_materialization (
    update TIMESTAMP,
    device_id TEXT,
    week TIMESTAMP,
    avg_val DOUBLE
    ) PARTITIONED BY(update);
  3. create timestamp e.g. ts_now

  4. Update data from the original view:

    INSERT INTO my_table_rollup_materialization SELECT <ts_now>, device_id,  week, avg_val FROM my_table;
  5. refresh the table

    REFRESH TABLE my_table_rollup_materialization;
  6. Update/create the "materialized" view, only if the insert was successful:

    CREATE OR REPLACE VIEW my_table_rollup_materialized AS 
    SELECT
    device_id,
    week,
    avg_val
    FROM my_table_rollup_materialization
    WHERE update = <ts_now>;
  7. Delete all old records from the materialization table:

    DELETE FROM my_table_rollup_materialization WHERE update < <ts_now>

... of course it would be much nicer to have syntactic sugar for this within CrateDB, as this approach also doesn't really work with user privileges

ckurze commented 1 year ago

Another option would be the implementation of something like a MERGE statement, i.e. recalculate the new data in the source table and merge it into a target table:

https://docs.oracle.com/en/database/oracle/oracle-database/12.2/sqlrf/MERGE.html

MERGE INTO bonuses D
   USING (SELECT employee_id, salary, department_id FROM hr.employees
   WHERE department_id = 80) S
   ON (D.employee_id = S.employee_id)
   WHEN MATCHED THEN UPDATE SET D.bonus = D.bonus + S.salary*.01
     DELETE WHERE (S.salary > 8000)
   WHEN NOT MATCHED THEN INSERT (D.employee_id, D.bonus)
     VALUES (S.employee_id, S.salary*.01)
     WHERE (S.salary <= 8000);
proddata commented 1 year ago

Another option would be the implementation of something like a MERGE statement, i.e. recalculate the new data in the source table and merge it into a target table:

Is not really an alternative to materialized views, but rather to INSERT INTO ... SELECT ... ON CONFLICT SET ... which already cover a lot of the MERGE use cases.

proddata commented 1 year ago

Draft proposal following PostgreSQL beahviour ⚠️ Feel free to edit ⚠️ No decision to implement

The propose approach largely follows the PostgreSQL implementation of materialized views.

CREATE MATERIALIZED VIEW is similar to CREATE TABLE AS, except that it also remembers the query used to initialize the view, so that it can be refreshed later upon demand. A materialized view has many of the same properties as a table, but there is no support for temporary materialized views. Criteria

CREATE MATERIALIZED VIEW [ IF NOT EXISTS ] table_name
    [ (column_name [, ...] ) ]
    [ WITH ( storage_parameter [= value] [, ... ] ) ]
    AS query
    [ WITH [ NO ] DATA ]

image

hlcianfagna commented 1 year ago

Related with https://github.com/crate/crate/issues/10731 and https://github.com/crate/crate/issues/12218

ckurze commented 1 year ago

Another option would be the implementation of something like a MERGE statement, i.e. recalculate the new data in the source table and merge it into a target table:

Is not really an alternative to materialized views, but rather to INSERT INTO ... SELECT ... ON CONFLICT SET ... which already cover a lot of the MERGE use cases.

Always something that could be positioned as "on-demand materialized view" and gives a good amount of freedom what do with the data in the "materialized view".

robd003 commented 10 months ago

Being able to define an automatic refresh interval (once every 15 mins / once every hour) would be really helpful. Being able to tweak the refresh rate w/o having to destroy & recreate the materialized view would also be great.

helenap commented 7 months ago

I could really use materialized views for complicated computations that are used repeatedly in other queries. Could you please make this a priority in 5.8?

hlcianfagna commented 7 months ago

Thank you for the feedback @helenap , while waiting for this to be implemented I would suggest either using the new SQL job scheduler in CrateDB Cloud or dbt's table/incremental materializations.