flashcatcloud / categraf

one-stop telemetry collector for nightingale
https://flashcat.cloud/docs/
MIT License
811 stars 250 forks source link

categraf输出到promethues的数据时间不对 #1026

Open jianghuren-01 opened 1 month ago

jianghuren-01 commented 1 month ago

Relevant config.toml

[global]
print_configs = false
hostname = ""
omit_hostname = false
interval = 15
providers = ["local"]
concurrency = -1
[global.labels]
[log]
file_name = "stdout"
max_size = 100
max_age = 1
max_backups = 1
local_time = true
compress = false
[writer_opt]
batch = 1000
chan_size = 1000000
[[writers]]
url = "http://127.0.0.1:17000/prometheus/v1/write"
basic_auth_user = ""
basic_auth_pass = ""
timeout = 5000
dial_timeout = 2500
max_idle_conns_per_host = 100
[http]
enable = false
address = ":9100"
print_access = false
run_mode = "release"
ignore_hostname = false
agent_host_tag = ""
ignore_global_labels = false
[ibex]
enable = false
interval = "1000ms"
servers = ["127.0.0.1:20090"]
meta_dir = "./meta"
[heartbeat]
enable = true
url = "http://127.0.0.1:17000/v1/n9e/heartbeat"
interval = 10
basic_auth_user = ""
basic_auth_pass = ""
timeout = 5000
dial_timeout = 2500
max_idle_conns_per_host = 100
[prometheus]
enable = false
scrape_config_file = "/path/to/in_cluster_scrape.yaml"
log_level = "info"
[[writers]]
url = "http://192.168.4.212:9090/api/v1/write"

Logs from categraf

14:43:00 aliyun_acs_rds_dashboard_iops_usage_minimum agent_hostname=ali-idn3-loki instance_id=rm-<instance_id_1> metric_name=IOPSUsage namespace=acs_rds_dashboard user_id=****** 0.011
14:43:00 aliyun_acs_rds_dashboard_iops_usage_average agent_hostname=ali-idn3-loki instance_id=rm-<instance_id_1> metric_name=IOPSUsage namespace=acs_rds_dashboard user_id=****** 0.011
14:43:00 aliyun_acs_rds_dashboard_iops_usage_minimum agent_hostname=ali-idn3-loki instance_id=rr-<instance_id_2> metric_name=IOPSUsage namespace=acs_rds_dashboard user_id=****** 0.008
14:43:00 aliyun_acs_rds_dashboard_iops_usage_average agent_hostname=ali-idn3-loki instance_id=rr-<instance_id_2> metric_name=IOPSUsage namespace=acs_rds_dashboard user_id=****** 0.008
14:43:00 aliyun_acs_rds_dashboard_iops_usage_maximum agent_hostname=ali-idn3-loki instance_id=rr-<instance_id_2> metric_name=IOPSUsage namespace=acs_rds_dashboard user_id=****** 0.008
15:43:05 aliyun_cms_request_count agent_hostname=ali-idn3-loki callee=DescribeMetricList metric_name=SQLServer_CpuUsage namespace=acs_rds_dashboard 1
15:43:05 aliyun_cms_request_count agent_hostname=ali-idn3-loki callee=DescribeMetricList metric_name=Cluster_ConnectionUsage namespace=acs_rds_dashboard 1
2024/08/09 15:43:05 metrics_reader.go:60: D! local.aliyun : after gather once, duration: 535.043559ms
2024/08/09 15:43:05 writers.go:139: D!, write 281 time series to all writers, cost: 2 ms

System info

categraf v0.3.76,centos 7.9

Docker

No response

Steps to reproduce

1.配置input.aliyun插件 2.启动categraf客户端

Expected behavior

数据正确写入victoria和prometheus数据源

Actual behavior

写入victoria的数据时间格式正常,写入promethues的时间错误

Additional info

No response

kongfei605 commented 1 month ago

不可能

jianghuren-01 commented 1 month ago

@kongfei605 这是我从两个数据源的查询结果,victoria是正常的,prometheus时间滞后了

n9e promethues
kongfei605 commented 1 month ago

prometheus 中是正常的

  1. aliyun插件本身有延迟,一般是5分钟可以配置更长。没给配置不确认。
  2. 检查下server端是否配置了ForceUseServerTS = true 这样任意指标上报都是用server端的当前时间
jianghuren-01 commented 1 month ago

@kongfei605 尝试在categraf段配置ForceUseServerTS = true参数,没有效果呢 阿里云配置如下: interval=60 ratelimit=50

catch_ttl="1h"

timeout="5s" [[instances]] region="ap-southeast-5" endpoint="metrics-vpc.ap-southeast-5.aliyuncs.com" access_key_id="" access_key_secret="**" namespaces=[ "acs_rds_dashboard" ] [[instances.metric_filters]] namespace="acs_rds_dashboard" metric_names=[ "MySQL_SlowQueries", "CpuUsage", "MemoryUsage", "DiskUsage", "MySQL_ActiveSessions", "ConnectionUsage", "IOPSUsage", "MySQL_NetworkInNew", "MySQL_NetworkOutNew", "MySQL_QPS", "MySQL_TPS", "MySQL_IbufUseRatio" ]

kongfei605 commented 1 month ago

@kongfei605 尝试在categraf段配置ForceUseServerTS = true参数,没有效果呢 阿里云配置如下: interval=60 ratelimit=50 #catch_ttl="1h" timeout="5s" [[instances]] region="ap-southeast-5" endpoint="metrics-vpc.ap-southeast-5.aliyuncs.com" access_key_id="" access_key_secret="**" namespaces=[ "acs_rds_dashboard" ] [[instances.metric_filters]] namespace="acs_rds_dashboard" metric_names=[ "MySQL_SlowQueries", "CpuUsage", "MemoryUsage", "DiskUsage", "MySQL_ActiveSessions", "ConnectionUsage", "IOPSUsage", "MySQL_NetworkInNew", "MySQL_NetworkOutNew", "MySQL_QPS", "MySQL_TPS", "MySQL_IbufUseRatio" ]

categraf 不支持ForceUseServerTS = true ,其他插件默认就是当前采集时间。 阿里云这个指标是自带时间的,所以不会附加为当前时间。

jianghuren-01 commented 4 weeks ago

@kongfei605尝试在categraf段参数配置ForceUseServerTS = true,没有效果呢阿里云配置:如下interval=60ratelimit=50#catch_ttl="1h"timeout="5s"[[instances]]region="ap-southeast-5"endpoint="metrics -vpc.ap-southeast-5.aliyuncs.com" access_key_id="" access_key_secret="**" 命名空间=[ "acs_rds_dashboard" ] [[instances.metric_filters]] 命名空间="acs_rds_dashboard" metric_names=[ " MySQL_SlowQueries"、"CpuUsage"、"MemoryUsage"、"DiskUsage"、"MySQL_ActiveSessions"、"ConnectionUsage"、"IOPSUsage"、"MySQL_NetworkInNew"、"MySQL_NetworkOutNew"、"MySQL_QPS"、"MySQL_TPS"、"MySQL_IbufUseRatio"]

categraf 不支持ForceUseServerTS = true ,其他插件默认就是当前采集时间。阿里云这个指标是自带时间的,所以不会附加为当前时间。

确实是这个问题,后续有改进计划么,我觉得删掉指标中自带的时间更合理

kongfei605 commented 3 weeks ago

本身就是这个用法。 阿里云采集的时间戳,删除了会更合理? 本来是10分钟前的故障,要当成现在的故障吗?

jianghuren-01 commented 3 weeks ago

本身就是这个用法。 阿里云采集的时间戳,删除了会更合理? 本来是10分钟前的故障,要当成现在的故障吗?

在我看来这就是一个时区转换问题,阿里云监控指标里面带了时区信息,但是prometheus数据记录是UTC时区,这直接影响可视化数据的准确性

kongfei605 commented 3 weeks ago

要不您再重新了解下时间戳的概念。

kongfei605 commented 3 weeks ago

btw, 时区差异也不能只差个5分钟10分钟