user/doc: Expand "Troubleshooting" section to "Monitor and Troubleshoot Materialize"

chuck-alt-delete commented 1 year ago

Documentation request

We have some great monitoring queries here in the demos repo for datadog and prometheus. Those introspection queries should be documented in a "monitoring materialize" section of the docs. It makes sense to me to combine monitoring and troubleshooting onto a single page, where we first help users with basic monitoring and then move to practical troubleshooting (eg are you in the right cluster? is there an index?) and then advanced troubleshooting (inspecting timely dataflow workers).

Here are a few other queries from my personal collection in addition to the ones:

Computational progress. Approximate the lag of materialized views and indexes. You can use this to, for example, check whether a new replica has caught up to an older replica when doing a zero downtime scale up/down.

SELECT
    f.time AS progress,
    mv.name AS materialized_view,
    idx.name AS index,
    mz_now()::text::numeric - f.time::text::numeric AS approx_lag_ms,
    c.name||'.'||r.name AS cluster_replica,
    f.export_id
FROM mz_internal.mz_cluster_replica_frontiers f
    JOIN mz_catalog.mz_cluster_replicas r
        ON r.id = f.replica_id
    JOIN mz_catalog.mz_clusters c
        ON r.cluster_id = c.id
    LEFT JOIN mz_catalog.mz_materialized_views mv
        ON mv.id = f.export_id
    LEFT JOIN mz_catalog.mz_indexes idx
        ON idx.id = f.export_id
WHERE
    (idx.name IS NULL OR idx.name NOT LIKE 'mz_%')
    AND c.name NOT LIKE 'mz_%';

A more human friendly source status history:

SELECT
    occurred_at,
    S.name AS source_name,
    status,
    C.name AS cluster_name,
    type,
    envelope_type,
    error,
    details
FROM
    mz_internal.mz_source_status_history H
    JOIN mz_catalog.mz_sources S
        ON (H.source_id = S.id)
    JOIN mz_catalog.mz_clusters C
        ON S.cluster_id = C.id
ORDER BY occurred_at DESC;

Permissions queries from here for RBAC may also be relevant in this section.

Affected Pages

No response

Related work

No response

chuck-alt-delete commented 1 year ago

Add to the list: if a query just won’t quit, kill it:

Many drivers implement cancel. But users can always run select pg_backend_pid() then select pg_cancel_backend() in a second connection if they super need it .

chuck-alt-delete commented 1 year ago

More to add to the list: “We should have some quickly accessible playbooks/checklists for people to run through. Like “why is my query slow?”:

are you doing select * from giant_thing limit 1? Here’s some guidance for what you should do instead
Are you in the right cluster?
Have you created any indexes? (Link to when to use indexes and mvs blog)
Is your cluster OOMing?
Are your transformations as efficient as they can be? (Link to optimization doc, which itself should be updated to include things we’ve learned about avoiding left joins when possible, using Union all rather than just Union when possible, etc)
Etc And similar playbooks for any number of the common/basic things people run into.”

chuck-alt-delete commented 1 year ago

OOM monitoring

https://github.com/MaterializeInc/demos/issues/86

chuck-alt-delete commented 11 months ago

Adding to the list of "why is my query slow?"

should you consider running in serializable transaction isolation to trade off freshness for response latency?
- Then, are you sure you are running in serializable (how to check statement log to verify)
Are you wrapping things in unfinished transactions without knowing it (eg if using SQLAlchemy or another ORM)? (how to check the statement log to verify)

uce commented 10 months ago

Feedback from @RobinClowers in Slack.

MaterializeInc / materialize