MaterializeInc / materialize

The Cloud Operational Data Store: use SQL to transform, deliver, and act on fast-changing data.
https://materialize.com
Other
5.72k stars 466 forks source link

user/doc: Expand "Troubleshooting" section to "Monitor and Troubleshoot Materialize" #20719

Open chuck-alt-delete opened 1 year ago

chuck-alt-delete commented 1 year ago

Documentation request

We have some great monitoring queries here in the demos repo for datadog and prometheus. Those introspection queries should be documented in a "monitoring materialize" section of the docs. It makes sense to me to combine monitoring and troubleshooting onto a single page, where we first help users with basic monitoring and then move to practical troubleshooting (eg are you in the right cluster? is there an index?) and then advanced troubleshooting (inspecting timely dataflow workers).

Here are a few other queries from my personal collection in addition to the ones:

Computational progress. Approximate the lag of materialized views and indexes. You can use this to, for example, check whether a new replica has caught up to an older replica when doing a zero downtime scale up/down.

SELECT
    f.time AS progress,
    mv.name AS materialized_view,
    idx.name AS index,
    mz_now()::text::numeric - f.time::text::numeric AS approx_lag_ms,
    c.name||'.'||r.name AS cluster_replica,
    f.export_id
FROM mz_internal.mz_cluster_replica_frontiers f
    JOIN mz_catalog.mz_cluster_replicas r
        ON r.id = f.replica_id
    JOIN mz_catalog.mz_clusters c
        ON r.cluster_id = c.id
    LEFT JOIN mz_catalog.mz_materialized_views mv
        ON mv.id = f.export_id
    LEFT JOIN mz_catalog.mz_indexes idx
        ON idx.id = f.export_id
WHERE
    (idx.name IS NULL OR idx.name NOT LIKE 'mz_%')
    AND c.name NOT LIKE 'mz_%';

A more human friendly source status history:

SELECT
    occurred_at,
    S.name AS source_name,
    status,
    C.name AS cluster_name,
    type,
    envelope_type,
    error,
    details
FROM
    mz_internal.mz_source_status_history H
    JOIN mz_catalog.mz_sources S
        ON (H.source_id = S.id)
    JOIN mz_catalog.mz_clusters C
        ON S.cluster_id = C.id
ORDER BY occurred_at DESC;

Permissions queries from here for RBAC may also be relevant in this section.

Affected Pages

No response

Related work

No response

chuck-alt-delete commented 1 year ago

Add to the list: if a query just won’t quit, kill it:

Many drivers implement cancel. But users can always run select pg_backend_pid() then select pg_cancel_backend() in a second connection if they super need it .

chuck-alt-delete commented 1 year ago

More to add to the list: “We should have some quickly accessible playbooks/checklists for people to run through. Like “why is my query slow?”:

chuck-alt-delete commented 1 year ago

OOM monitoring

chuck-alt-delete commented 11 months ago

Adding to the list of "why is my query slow?"

uce commented 10 months ago

Feedback from @RobinClowers in Slack.