Dgraph metrics allow counting of successful and errored queries and mutations, as well as latency averages, but this is not documented, and the metric names are called "latency" which is confusing.
Note that the most reliable list of available metrics is the response from the /prometheus and similar endpoints on a running alpha, which return all metrics names (formatted for prometheus) as well as short descriptions, but Dgraph also documents key metrics at https://dgraph.io/docs/deploy/metrics/#activity-metrics and the latency/count metrics should be documented there.
Together, these can be used to compute the average latency of all requests. Also both have a "method" and "status" property that distinguish between query/mutation, and success/error respectively, so they can be used to count errors and queries, as well as latencies.
To compute query and error rates, use dgraph_grpc_io_client_roundtrip_latency_count only (it says latency, but is a categorized count of all operations, so can be used to count operations generally)
--
Also
dgraph_num_backups_total
should be used to monitor when backups have happened (typically via promql rate( {dgraph_num_backups_total}[5m]) or similar) so any slow or unusual activity can be correlated with backup activity if that is relevant.
Dgraph metrics allow counting of successful and errored queries and mutations, as well as latency averages, but this is not documented, and the metric names are called "latency" which is confusing.
Note that the most reliable list of available metrics is the response from the /prometheus and similar endpoints on a running alpha, which return all metrics names (formatted for prometheus) as well as short descriptions, but Dgraph also documents key metrics at https://dgraph.io/docs/deploy/metrics/#activity-metrics and the latency/count metrics should be documented there.
The specific metrics to know about are
Together, these can be used to compute the average latency of all requests. Also both have a "method" and "status" property that distinguish between query/mutation, and success/error respectively, so they can be used to count errors and queries, as well as latencies.
To compute query and error rates, use dgraph_grpc_io_client_roundtrip_latency_count only (it says latency, but is a categorized count of all operations, so can be used to count operations generally)
--
Also dgraph_num_backups_total should be used to monitor when backups have happened (typically via promql rate( {dgraph_num_backups_total}[5m]) or similar) so any slow or unusual activity can be correlated with backup activity if that is relevant.