Expose metrics for each service

hackaugusto commented 9 months ago

[ ] Decide on a library to collect and publish metrics
- [ ] candidate https://github.com/prometheus/client_rust

Add metrics for each component, this will change but here is an initial list:

[ ] All
- [ ] Collect metrics from the event loop, e.g. tokio-metrics
- [ ] Collect metrics from gRPC / tonic / axum
  - [ ] Number of requests, with status (200, 404, 500, etc), per request handler
  - [ ] Timing for request handlers with percentile (at least 99, 95, 50)
  - [ ] Request/Response sizes with percentiles
- [ ] Timing for downstream requests with percentile
- [ ] Number of seconds the service has been running
- [ ] Metrics from the control plane
  - [ ] Number of operations performed
  - [ ] Configuration reload
[ ] Store
- [ ] Number of blocks persisted so far
- [ ] Number of leaves in the MMR
- [ ] Number of non-empty leaves in the Nullifier tree
- [ ] Number of accounts created
- [ ] Number of transactions persisted so far (not sure if we have this)
- [ ] Size of the chain (could be approximated by the size of the sqlite db file)
- [ ] Percentiles for queries
  - [ ] This overlaps with the spans from the distributed tracer, ideally we should reuse the data from the tracer for metrics too
[ ] Block Producer
- [ ] Number of known provers, with status (e.g. healthy/unhealthy, or responsive/unresponsive)
- [ ] Number of locally proven transactions waiting in the queue
- [ ] Number of chain transactions waiting in the queue
- [ ] Number of batches in the queue
- [ ] Number of accounts with in-flight transactions
- [ ] Number of in-flight notes
- [ ] Timing for proving times with percentile (at least 99, 95, 50)
  - [ ] for the chain transactions
  - [ ] for the batches
  - [ ] for the block
- [ ] Age of the oldest transaction waiting to be included in a batch (this is probably the best strategy to trigger when to increase the number of batch prover machines)
[ ] RPC

Node metrics (cpu, memory, disk usage, etc) should not be exposed here, this should be done by an external agent (e.g. https://prometheus.io/docs/instrumenting/exporters/)

hackaugusto commented 9 months ago

The above metrics allows an operator to monitor and troubleshoot a system. For example:

Severe issues can be detected when:

The number of blocks in the store doesn't increase
The number of non status-200 responses increases
The number of transactions in the store is not increasing, but the number of transactions in the block producer is
The runtime seconds was reset (i.e. the process was killed/restarted)

Performance issues can be detected when:

The percentiles of the request handlers is too high
The percentiles of the provers is too high
The number of transactions in-flight is increasing
There are failures when generating proofs, and the cpu/memory of the provers is high

Each of the above scenarios requires different operation. The first will need additional debugging, and looking over the logs and tracing. The later requires increasing the number of provers, and maybe their sizes.

These metrics can be collected via prometheus, influxdb, opentsdb, and alerts can be created via alertmanager, opengenie. The metrics can be inspect via graphana, and so on.

okcan commented 8 months ago

Can I get details these metrics? Such as Data, Size, type so I may recommend to use these tools "These metrics can be collected via prometheus, influxdb, opentsdb, and alerts can be created via alertmanager, opengenie. The metrics can be inspect via graphana, and so on."

hackaugusto commented 8 months ago

Can I get details these metrics? Such as Data, Size, type so I may recommend to use these tools

There are a mix of gauges, counters, histogram, and events. For example:

number of transactions in a pool: gauge
number of seconds the server is running: counter
percentiles: histogram
restarts/reconfigure: events

Some of the metrics would benefit of tag metadata, specially the http response status

okcan commented 8 months ago

Thanks, got it, what kind of data entry do you expect, daily or instantaneous? Such as hourly/GB hourly/MB and also whats your expectation for total size? TB?

hackaugusto commented 8 months ago

Thanks, got it, what kind of data entry do you expect, daily or instantaneous? Such as hourly/GB hourly/MB and also whats your expectation for total size? TB?

Near real time. The metrics should support automatic alerting and incident detection. To collect state transitions timing would have to be relative to our batch timeouts Duration::from_secs(2), that would need roughly 1 sample every 0.5s. I think that is too high, so instead of state transitions we can just use the metrics to collect trends, and a scrape time of 10s will do.

As for the size of the data, it depends on the number of metrics, their encoding, the data retention policy, and number of nodes. For some back of the envelope, lets assume 5 nodes. Lets assume no compression and that each data point takes 4bytes (u32/f32 should be enough). Lets assume the histograms have 6 metrics (99, 95, 50, average, sum, count).

A rough count on the number of metrics:

35 metrics for tokio, per node (175 in total)
28 metrics per task, lets estimate the number of tracked tasks as 3 per node (420 in total)
Two histograms per endpoint, + at least 3 buckets for response status. Lets say there are 10 endpoints per node (the real number is more like 3 for the time being). (750 in total)
system metrics, e.g. memory, disk, network, cpu, loadavg. Using node_exporter as a reference on my machine with the default these are 438 metrics curl http://localhost:9100/metrics | grep -v '^#' | wc (will change depending on the number of devices/partitions etc)
store metrics (12 total)
block producer (25 total)

The above is about 2k metrics total, with about 8640 points per day at 4bytes, that is less than 100MB a day. I think we can round it to 1GB for good measure, and assume the log retention will be one month long at full precision, so 30GB a month would be sufficient.

okcan commented 8 months ago

Thank you @hackaugusto for the great explanation

0xPolygonMiden / miden-node

Expose metrics for each service #144