Open hackaugusto opened 9 months ago
The above metrics allows an operator to monitor and troubleshoot a system. For example:
Severe issues can be detected when:
Performance issues can be detected when:
Each of the above scenarios requires different operation. The first will need additional debugging, and looking over the logs and tracing. The later requires increasing the number of provers, and maybe their sizes.
These metrics can be collected via prometheus, influxdb, opentsdb, and alerts can be created via alertmanager, opengenie. The metrics can be inspect via graphana, and so on.
Can I get details these metrics? Such as Data, Size, type so I may recommend to use these tools "These metrics can be collected via prometheus, influxdb, opentsdb, and alerts can be created via alertmanager, opengenie. The metrics can be inspect via graphana, and so on."
Can I get details these metrics? Such as Data, Size, type so I may recommend to use these tools
There are a mix of gauges, counters, histogram, and events. For example:
Some of the metrics would benefit of tag metadata, specially the http response status
Thanks, got it, what kind of data entry do you expect, daily or instantaneous? Such as hourly/GB hourly/MB and also whats your expectation for total size? TB?
Thanks, got it, what kind of data entry do you expect, daily or instantaneous? Such as hourly/GB hourly/MB and also whats your expectation for total size? TB?
Near real time. The metrics should support automatic alerting and incident detection. To collect state transitions timing would have to be relative to our batch timeouts Duration::from_secs(2)
, that would need roughly 1 sample every 0.5s
. I think that is too high, so instead of state transitions we can just use the metrics to collect trends, and a scrape time of 10s will do.
As for the size of the data, it depends on the number of metrics, their encoding, the data retention policy, and number of nodes. For some back of the envelope, lets assume 5 nodes. Lets assume no compression and that each data point takes 4bytes (u32/f32 should be enough). Lets assume the histograms have 6 metrics (99, 95, 50, average, sum, count).
A rough count on the number of metrics:
node_exporter
as a reference on my machine with the default these are 438 metrics curl http://localhost:9100/metrics | grep -v '^#' | wc
(will change depending on the number of devices/partitions etc)The above is about 2k metrics total, with about 8640 points per day at 4bytes, that is less than 100MB a day. I think we can round it to 1GB for good measure, and assume the log retention will be one month long at full precision, so 30GB a month would be sufficient.
Thank you @hackaugusto for the great explanation
Add metrics for each component, this will change but here is an initial list:
Node metrics (cpu, memory, disk usage, etc) should not be exposed here, this should be done by an external agent (e.g. https://prometheus.io/docs/instrumenting/exporters/)