Prometheus in practices

下面是prometheus的架构图：基本结构比较简单。普罗能胜出，是以下能力比较突出

对云原生友好。拉模式服务器无感知；在k8s中部署方便
查询灵活。PromQL支持很多高级的查询
生态丰富。这个是加入CNCF后，得到了大公司的支持

exporter

exporter是数据源，有大量的开源exporter在prometheus生态。最常用的比如node_exporter，监控主机的允许状态。上报的数据是一种可读的文本内容，官方称之为exposition_formats。语法相当简洁

metric_name [
  "{" label_name "=" `"` label_value `"` { "," label_name "=" `"` label_value `"` } [ "," ] "}"
] value [ timestamp ]

监控的值分类可用借用go-zero的思想

名称	用途	搜集函数
CounterVec	计数器。示例：QPS统计	CounterVec.Inc() 指标+1
GuageVec	指标，支持按属性区分，带时间戳。示例：CPU使用率	GuageVec.Inc()/GuageVec.Add() 指标+1/指标加N，也可以为负数
HistogramVec	数值分布。示例：请求耗时、响应大小	HistogramVec.Observe(val, labels) 记录指标当前对应值，并找到值所在的桶，+1

实现代码

示例上报

# HELP http_requests_total The total number of HTTP requests.
# TYPE http_requests_total counter
http_requests_total{method="post",code="200"} 1027 1395066363000
http_requests_total{method="post",code="400"}    3 1395066363000

# Escaping in label values:
msdos_file_access_time_seconds{path="C:\\DIR\\FILE.TXT",error="Cannot find file:\n\"FILE.TXT\""} 1.458255915e9

# Minimalistic line:
metric_without_timestamp_and_labels 12.47

# A weird metric from before the epoch:
something_weird{problem="division by zero"} +Inf -3982045

# A histogram, which has a pretty complex representation in the text format:
# HELP http_request_duration_seconds A histogram of the request duration.
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.05"} 24054
http_request_duration_seconds_bucket{le="0.1"} 33444
http_request_duration_seconds_bucket{le="0.2"} 100392
http_request_duration_seconds_bucket{le="0.5"} 129389
http_request_duration_seconds_bucket{le="1"} 133988
http_request_duration_seconds_bucket{le="+Inf"} 144320
http_request_duration_seconds_sum 53423
http_request_duration_seconds_count 144320

# Finally a summary, which has a complex representation, too:
# HELP rpc_duration_seconds A summary of the RPC duration in seconds.
# TYPE rpc_duration_seconds summary
rpc_duration_seconds{quantile="0.01"} 3102
rpc_duration_seconds{quantile="0.05"} 3272
rpc_duration_seconds{quantile="0.5"} 4773
rpc_duration_seconds{quantile="0.9"} 9001
rpc_duration_seconds{quantile="0.99"} 76656
rpc_duration_seconds_sum 1.7560473e+07
rpc_duration_seconds_count 2693

比较复杂的是histogram，要求包含三段数据：原始计数值（Bucket）、总样本数（Count）、总和（Sum）。原始计数bucket比较复杂，它需要exporter统计不同区间的值。常用于计算百分位的统计。 summary也是三段数据，但是它不要求prometheus为我们计算。把bucket替换为分位（quantile），即exportor负责计算统计值。

API

How to Monitor/Instrument Golang with Prometheus (Counter - Gauge - Histogram - Summary) 如果不考虑手写上面的exporter导出的数据，官方提供的 prometheus/client_golang 包是最佳选择。

查询DSL

PromQL查询强大，内置常用的数学计算和统计，满足复杂图表和报警要求。官方示例

prometheus内置的webui能展示Table和Graph两种图表，但是它的界面一般，只能说对于一些简单的看板是满足的。复杂和漂亮的看板需使用Grafana，它的生态也非常丰富，有大量的模版可供选择。

报警

TODO

不擅长的领域

它使用一种称为 Time Series Database (TSDB) 的专门存储引擎来高效地处理时间序列数据，对PromQL查询高度优化。采用了基于时间的滚动存储策略，默认情况下，数据保留一定时期后（如15天或用户自定义时间），旧的数据块会被自动删除，以保证存储空间的管理。因此它没有持久化数据的能力。
拉模式的间隔一般比较长，还可能内部重采样而丢失一部分数据。所以它的指标并非100%精确。

annidy / notes