fstab / grok_exporter

Export Prometheus metrics from arbitrary unstructured log data.
Apache License 2.0
891 stars 152 forks source link

How retention works? #67

Open kingbyteking opened 5 years ago

kingbyteking commented 5 years ago

In my grok configuration file, I enabled retention setting for gauge metrics. retention: 5m I use curl command to get the metrics. Even after 10 minutes of metrics being generated, I can still see my self defined metrics in http://localhost:9144/metrics

Would some one help me out how retention setting works, why it's still exist in http://localhost:9144/metrics? command I use: curl http://localhost:9144/metrics|grep my_metrics | wc -l ... 546

should I use some parameter setting to get the correct result. I expect expired metrics should not be present in /metrics.

fstab commented 5 years ago

Please attach your config file, I'll have a look.

kingbyteking commented 5 years ago

Attached please find the conf file. The reason I'm asking this question is due the fact I observed: When using Prometheus query, no data returned for those expired metrics. But from port 9144/metrics, the metrics are all there. I assume grok_exporter should not display those expired metrics in port 9144?

Is there any way to get debug info. It's seems some expired metrics are removed, but some are kept. Not sure if it's my input file issue or not.

kingbyteking commented 5 years ago

global: config_version: 2 retention_check_interval: 53s input: type: file path: /path/to/input readall: false # Read from the beginning of the file? False means we start at the end of the file and read only new lines. grok: patterns_dir: /path/to/pattern metrics:

server: port: 9144

fstab commented 5 years ago

Your config looks ok. The expected behavior is: If a metric is not updated for 5:53 minutes (retention time plus check interval), the metric should disappear from http://localhost:9144/metrics. Are you sure that no new log lines for these metrics are written?

Quick experiment to verify that retention works in general: The grok_exporter distribution has an example in ./example/config.yaml. The example metric is a counter named exim_rejected_rcpt_total. I copied and pasted the metric to create exim_rejected_rcpt_total2, which is exactly the same as exim_rejected_rcpt_total but with retention: 2m:

global:
    config_version: 2
input:
    type: file
    path: ./example/exim-rejected-RCPT-examples.log
    readall: true # Read from the beginning of the file? False means we start at the end of the file and read only new lines.
grok:
    patterns_dir: ./logstash-patterns-core/patterns
    additional_patterns:
    - 'EXIM_MESSAGE [a-zA-Z ]*'
metrics:
    - type: counter
      name: exim_rejected_rcpt_total
      help: Total number of rejected recipients, partitioned by error message.
      match: '%{EXIM_DATE} %{EXIM_REMOTE_HOST} F=<%{EMAILADDRESS}> rejected RCPT <%{EMAILADDRESS}>: %{EXIM_MESSAGE:message}'
      labels:
          error_message: '{{.message}}'
    - type: counter
      name: exim_rejected_rcpt_total2
      help: Total number of rejected recipients, partitioned by error message.
      match: '%{EXIM_DATE} %{EXIM_REMOTE_HOST} F=<%{EMAILADDRESS}> rejected RCPT <%{EMAILADDRESS}>: %{EXIM_MESSAGE:message}'
      labels:
          error_message: '{{.message}}'
      retention: 2m
server:
    host: localhost
    port: 9144

Now I run grok_exporter -config ./example/config.yml. Initially, I see the same matches for both metrics:

exim_rejected_rcpt_total{error_message="Sender verify failed"} 2000
exim_rejected_rcpt_total{error_message="Unrouteable address"} 32
exim_rejected_rcpt_total{error_message="relay not permitted"} 165

exim_rejected_rcpt_total2{error_message="Sender verify failed"} 2000
exim_rejected_rcpt_total2{error_message="Unrouteable address"} 32
exim_rejected_rcpt_total2{error_message="relay not permitted"} 165

Obviously no new log messages are written, the logfile is unchanged. After 3 minutes, I see the exim_rejected_rcpt_total2 disappear and only the exim_rejected_rcpt_total metrics are left:

exim_rejected_rcpt_total{error_message="Sender verify failed"} 2000
exim_rejected_rcpt_total{error_message="Unrouteable address"} 32
exim_rejected_rcpt_total{error_message="relay not permitted"} 165

You should have similar behavior with your config.

kingbyteking commented 5 years ago

I have the following observation these days, the finding is: only one metrics retention didn't work, which means it's there all the time. the other metrics could be deleted by retention timer. For the un-deleted metrics, I have some empty lable value "" in the metrics, not sure if it's the cause. I will try to verify this way and feedback later.

kingbyteking commented 5 years ago

I replace those lable value from "" to "-", then the retention works normal now. Will the team fix this issue?

So far, I'm using prometheus client lib to push metrics to resolve this issue.

fstab commented 5 years ago

Thanks for the analysis, I'll look into this.