dgraph-io / dgraph-docs

A native GraphQL Database with a graph backend
https://dgraph.io/docs
Other
35 stars 66 forks source link

Add documentation of request and latency metrics, backups to docs #496

Closed damonfeldman closed 3 months ago

damonfeldman commented 1 year ago

Dgraph metrics allow counting of successful and errored queries and mutations, as well as latency averages, but this is not documented, and the metric names are called "latency" which is confusing.

Note that the most reliable list of available metrics is the response from the /prometheus and similar endpoints on a running alpha, which return all metrics names (formatted for prometheus) as well as short descriptions, but Dgraph also documents key metrics at https://dgraph.io/docs/deploy/metrics/#activity-metrics and the latency/count metrics should be documented there.

The specific metrics to know about are

dgraph_grpc_io_client_roundtrip_latency_count
dgraph_grpc_io_client_roundtrip_latency_sum

Together, these can be used to compute the average latency of all requests. Also both have a "method" and "status" property that distinguish between query/mutation, and success/error respectively, so they can be used to count errors and queries, as well as latencies.

To compute query and error rates, use dgraph_grpc_io_client_roundtrip_latency_count only (it says latency, but is a categorized count of all operations, so can be used to count operations generally)

--

Also dgraph_num_backups_total should be used to monitor when backups have happened (typically via promql rate( {dgraph_num_backups_total}[5m]) or similar) so any slow or unusual activity can be correlated with backup activity if that is relevant.

github-actions[bot] commented 3 months ago

This issue has been stale for 60 days and will be closed automatically in 7 days. Comment to keep it open.