in_monitor_agent: retry_count and slow_flush_count not resetting to zero after successful retry

g3kr commented 3 years ago

Describe the bug

We are using in_monitor_agent to monitor the metrics from fluentd. Based on the emitted metrics we have alerts being sent out. We observed that the retry_count metric and the slow_flush_count metric does not reset to zero when things fall back in place. Unless you restart the fluentd process/task these numbers keep incrementing.

To Reproduce

Run fluentd with the below config and force retry to happen by sending large number of logs to Fluentd. query the retry_count metric and observe that after successful retry the count has not been reset

Expected behavior

retry_count and slow_flush_count set back to 0 after successful retry

Your Environment

- Fluentd version: fluentd -v 1.12.4
- Environment: Docker running on Amazon Linux 2

Your Configuration

<source>
    @type monitor_agent
    @id in_monitor_agent
    @log_level info
    @label @INTERNALMETRICS
    tag "monitor.#{ENV['TaskID']}"
    emit_interval 60
  </source>

<label @INTERNALMETRICS>
    <filter monitor.**>
      @type record_modifier
      <record>
        TaskID "#{ENV['TaskID']}"
        ECS_CLUSTER "#{ENV['ECS_CLUSTER_NAME']}"
        @timestamp ${require 'time'; Time.at(time).strftime('%Y-%m-%dT%H:%M:%S.%3N')}
      </record>
    </filter>
    <match monitor.**>
        @type copy
        <store>
          @type stdout
        </store>
        <store>
          @type elasticsearch
          host "#{ENV['ES_HOSTNAME']}"
          port 9243
          user "#{ENV['ES_USERNAME']}"
          password "#{ENV['ES_PASSWORD']}"
          scheme https
          with_transporter_log true
          ssl_verify false
          ssl_version TLSv1_2
          index_name "#{ENV['ES_index']}"
          reconnect_on_error true
          reload_connections false
          reload_on_failure true
          suppress_type_name true
          request_timeout 30s
          prefer_oj_serializer true
          type_name _doc
          </store>
    </match>
  </label>

Your Error Log

{
        "_index" : "agg-metrics",
        "_type" : "_doc",
        "_id" : "WqrN6nsBy9uSnPxiS8mH",
        "_score" : 0.0,
        "_source" : {
          "plugin_id" : "es_output",
          "plugin_category" : "output",
          "type" : "elasticsearch",
          "output_plugin" : true,
          "buffer_queue_length" : 0,
          "buffer_timekeys" : [ ],
          "buffer_total_queued_size" : -135483,
          "retry_count" : 76,
          "emit_records" : 1614787,
          "emit_count" : 416410,
          "write_count" : 66858,
          "rollback_count" : 76,
          "slow_flush_count" : 39,
          "flush_time_count" : 40485392,
          "buffer_stage_length" : 1,
          "buffer_stage_byte_size" : 19338,
          "buffer_queue_byte_size" : -154821,
          "buffer_available_buffer_space_ratios" : 100.0,
          "TaskID" : "0da43d9abf1d492dbae9bb14c5bdqazx",
          "ECS_CLUSTER" : "aggregator-service-ECSCluster",
          "@timestamp" : "2021-09-15T18:51:04.842"
        }

Additional context

No response

g3kr commented 3 years ago

@repeatedly any observation/thoughts on this?

cosmo0920 commented 3 years ago

This should be counter which is cumulative counter not gauge which is reset-able counter metrics. Not resetting them is expected behavior.

g3kr commented 3 years ago

@cosmo0920 Thanks for getting back on this. In that case, is there a metric we can use for alerting for anomalies?

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has been open 90 days with no activity. Remove stale label or comment or this issue will be closed in 30 days

kenhys commented 2 years ago

@g3kr

Maybe count of steps will help.

https://docs.fluentd.org/input/monitor_agent#retry

fluent / fluentd