0xPolygonMiden / miden-node

Reference implementation of the node for the Polygon Miden rollup
MIT License
53 stars 37 forks source link

Expose metrics for each service #144

Open hackaugusto opened 9 months ago

hackaugusto commented 9 months ago

Add metrics for each component, this will change but here is an initial list:

Node metrics (cpu, memory, disk usage, etc) should not be exposed here, this should be done by an external agent (e.g. https://prometheus.io/docs/instrumenting/exporters/)

hackaugusto commented 9 months ago

The above metrics allows an operator to monitor and troubleshoot a system. For example:

Severe issues can be detected when:

Performance issues can be detected when:

Each of the above scenarios requires different operation. The first will need additional debugging, and looking over the logs and tracing. The later requires increasing the number of provers, and maybe their sizes.

These metrics can be collected via prometheus, influxdb, opentsdb, and alerts can be created via alertmanager, opengenie. The metrics can be inspect via graphana, and so on.

okcan commented 8 months ago

Can I get details these metrics? Such as Data, Size, type so I may recommend to use these tools "These metrics can be collected via prometheus, influxdb, opentsdb, and alerts can be created via alertmanager, opengenie. The metrics can be inspect via graphana, and so on."

hackaugusto commented 8 months ago

Can I get details these metrics? Such as Data, Size, type so I may recommend to use these tools

There are a mix of gauges, counters, histogram, and events. For example:

Some of the metrics would benefit of tag metadata, specially the http response status

okcan commented 8 months ago

Thanks, got it, what kind of data entry do you expect, daily or instantaneous? Such as hourly/GB hourly/MB and also whats your expectation for total size? TB?

hackaugusto commented 8 months ago

Thanks, got it, what kind of data entry do you expect, daily or instantaneous? Such as hourly/GB hourly/MB and also whats your expectation for total size? TB?

Near real time. The metrics should support automatic alerting and incident detection. To collect state transitions timing would have to be relative to our batch timeouts Duration::from_secs(2), that would need roughly 1 sample every 0.5s. I think that is too high, so instead of state transitions we can just use the metrics to collect trends, and a scrape time of 10s will do.

As for the size of the data, it depends on the number of metrics, their encoding, the data retention policy, and number of nodes. For some back of the envelope, lets assume 5 nodes. Lets assume no compression and that each data point takes 4bytes (u32/f32 should be enough). Lets assume the histograms have 6 metrics (99, 95, 50, average, sum, count).

A rough count on the number of metrics:

The above is about 2k metrics total, with about 8640 points per day at 4bytes, that is less than 100MB a day. I think we can round it to 1GB for good measure, and assume the log retention will be one month long at full precision, so 30GB a month would be sufficient.

okcan commented 8 months ago

Thank you @hackaugusto for the great explanation