Validating optimizer changes against production catalogs

MaterializeInc / materialize

The Cloud Operational Data Store: use SQL to transform, deliver, and act on fast-changing data.

https://materialize.com

Other

5.71k stars 466 forks source link

Validating optimizer changes against production catalogs #17883

Open teskje opened 1 year ago

teskje commented 1 year ago

To further ensure before a release that there are no regressions in the plans produced by the optimizer, it would be helpful if we had a way to run it on DDL statements stored in production catalogs.

The way this could work is by having a framework that:

Connects to a given cloud environment, or several ones in sequence.
Uses stash-debug to extract DDL statements from the catalog.
Optimizes these DDL statements both with the deployed and the new version of the optimizer.
Performs validation based on the outcome of these two optimization runs.

The validation should at least include verifying that the new optimizer can handle the DDL without errors. Apart from that, we can also verify that the resulting plans are the same or have only changed in expected ways.

philip-stoev commented 1 year ago

https://resources.snowflake.com/report/snowtrail-testing-with-production-series-on-a-cloud-database

ggevay commented 1 year ago

Some ideas for specific automated checks (from https://materializeinc.slack.com/archives/C04QHS93G0Z/p1678703336777329?thread_ts=1678646973.384439&cid=C04QHS93G0Z):

We shouldn't crash...
If the whole query was an MFP, is it still an MFP?
Has the number of cross joins not increased?
Did the number of IndexedFilters not decreased?
If something was planned with a monotonic Reduce or TopK implementation before, is it still planned with the monotonic implementation?
Has the time required by the optimizer didn't go up too much?
Maybe check for the total number of MirRelationExpr / MirScalarExpr nodes going up, by say twofold.
Check total number of ArrangeBys.

mgree commented 1 year ago

Automated checks could also use a (hypothetical, for now) explicit cost model on MIR to check for regressions.

philip-stoev commented 1 year ago

Proposed implementation https://www.notion.so/materialize/1c36cdc61a224948b623318a02db3bb8

ggevay commented 1 year ago

This was again relevant for incident-55: making sure that the incident was not caused by plan changes was quite time consuming, because I had to look into the audit log to verify that the same indexes existed when originally creating the dataflows. It would be great to automate this.

ggevay commented 1 month ago

In the meantime, there are various things that started to go in this direction:

mzexplore
upgrade-check https://www.notion.so/materialize/running-upgrade-check-on-existing-environments-df78dbc9e50e456d92bd4ce5136cf03a

We should decide which of these we'd like to develop further to do the optimizer checks proposed here.