flashcatcloud / categraf

one-stop telemetry collector for nightingale
https://flashcat.cloud/docs/
MIT License
848 stars 256 forks source link

input.prometheus 插件采集数据不全 #510

Closed song-yunfei closed 1 year ago

song-yunfei commented 1 year ago

Relevant config.toml

----config.toml----
[global]
# whether print configs
print_configs = false

# add label(agent_hostname) to series
# "" -> auto detect hostname
# "xx" -> use specified string xx
# "$hostname" -> auto detect hostname
# "$ip" -> auto detect ip
# "$hostname-$ip" -> auto detect hostname and ip to replace the vars
hostname = ""

# will not add label(agent_hostname) if true
omit_hostname = false

# s | ms
precision = "ms"

# global collect interval
interval = 60

# input provider settings; optional: local / http
providers = ["local"]

disable_usage_report = true

[global.labels]

datacenter = "IDC"
region = "North_China"
zone = "BeiJing"

[log]
# file_name is the file to write logs to
file_name = "stdout"

# options below will not be work when file_name is stdout or stderr
# max_size is the maximum size in megabytes of the log file before it gets rotated. It defaults to 100 megabytes.
max_size = 100
# max_age is the maximum number of days to retain old log files based on the timestamp encoded in their filename.
max_age = 1
# max_backups is the maximum number of old log files to retain.
max_backups = 1
# local_time determines if the time used for formatting the timestamps in backup files is the computer's local time.
local_time = true
# Compress determines if the rotated log files should be compressed using gzip.
compress = false

[writer_opt]
batch = 1000
chan_size = 1000000

[[writers]]
url = "http://n9e.xx.xx:19000/prometheus/v1/write"

# Basic auth username
basic_auth_user = ""

# Basic auth password
basic_auth_pass = ""

## Optional headers
# headers = ["X-From", "categraf", "X-Xyz", "abc"]

# timeout settings, unit: ms
timeout = 5000
dial_timeout = 2500
max_idle_conns_per_host = 100

[http]
enable = false
address = ":9100"
print_access = false
run_mode = "release"

[ibex]
enable = false
## ibex flush interval
interval = "1000ms"
## n9e ibex server rpc address
servers = ["n9e.xx.xx:20090"]
## temp script dir
meta_dir = "./meta"

[heartbeat]
enable = true

# report os version cpu.util mem.util metadata
url = "http://xx.xx.xx:19000/v1/n9e/heartbeat"

# interval, unit: s
interval = 10

# Basic auth username
basic_auth_user = ""

# Basic auth password
basic_auth_pass = ""

## Optional headers
# headers = ["X-From", "categraf", "X-Xyz", "abc"]

# timeout settings, unit: ms
timeout = 5000
dial_timeout = 2500
max_idle_conns_per_host = 100

---- input.prometheus/prometheus.toml----
interval = 30

[[instances]]
 urls = [
     "http://127.0.0.1:8030/metrics"
]

url_label_key = "instance"
url_label_value = "{{.Host}}"
timeout = "10s"
## Scrape Services available in Consul Catalog
#  [[instances.consul.query]]
#    name = "a service name"
#    tag = "a service tag"
#    url = 'http://{{if ne .ServiceAddress ""}}{{.ServiceAddress}}{{else}}{{.Address}}{{end}}:{{.ServicePort}}/{{with .ServiceMeta.metrics_path}}{{.}}{{else}}metrics{{end}}'
#    [instances.consul.query.tags]
#      host = "{{.Node}}"
#
# bearer_token_string = ""

# e.g. /run/secrets/kubernetes.io/serviceaccount/token
# bearer_token_file = ""

# # basic auth
# username = ""
# password = ""

# headers = ["X-From", "categraf"]

# # interval = global.interval * interval_times
# interval_times = 4

labels = {group="fe",job="Doris-fe",cluster="online"}

# support glob
ignore_metrics = [ "go_*" ]

Logs from categraf

无异常日志

System info

v0.3.0-3ae9599251088bae5414c8c0c776d3649613c8cc

Docker

No response

Steps to reproduce

  1. 采集目标为Doris 集群版本 1.1.3-rc02-b4364b451 的 metrics接口 2.丢失数据示例: 怀疑是标签数据问题 doris_fe_editlog_write_latency_ms{quantile="0.75"} 1.0 doris_fe_query_latency_ms{quantile="0.75"} 14.0 我测试了 0.2.29 和0.2.28 都不可以 categraf 回退到了0.2.9之后恢复正常 我的server版本是 v5.15.0 数据存储为 vmcluster

Expected behavior

上报完整的metrics数据

Actual behavior

丢失metrics数据

Additional info

No response

kongfei605 commented 1 year ago

./categraf --test --inputs prometheus --debug 看看

song-yunfei commented 1 year ago

image 我测试了一下 发现指标是存在的 但是名称后面增加了一个后缀,如截图 图中2.9 是正常的,有后缀名的是0.3.0的版本。

song-yunfei commented 1 year ago

./categraf --test --inputs prometheus --debug 看看

kongfei605 commented 1 year ago

这个问题是说版本更新后,指标名称发生了变化,对吗?

song-yunfei commented 1 year ago

是的,categraf 增加了一个后缀

kongfei605 commented 1 year ago

有可能。 翻了下changelog,设置quantile改动的基本在这https://github.com/flashcatcloud/categraf/compare/v0.2.35...v0.2.36,不过看起来也是代码移动,不是增加指标后缀。 社区有一个基本目标就是尽量维持指标名称的稳定,没有什么特殊理由不会乱改动指标的。

如果只是指标名称改动,这个issue就先关闭了,还有其他疑问可以随时打开。

song-yunfei commented 1 year ago

可以关闭