corvus-ch / rabbitmq-cli-consumer

Consume RabbitMQ messages into any cli program
MIT License
236 stars 36 forks source link

Prometheus metrics #63

Closed estahn closed 5 years ago

estahn commented 5 years ago

We're currently scaling consumers based on the number of messages. Our scaling works as follows:

Example:

This assumes the processing time averages around 500ms which may deviate depending on possible code changes or other external dependencies.

In order to monitor processing times I would like to introduce metrics.

Metrics:

Are there any other metrics that would be useful?

corvus-ch commented 5 years ago

Adding metrics is something I wanted to do for quite some time. So yes, I am all for it.

Are there any other metrics that would be useful?

There certainly are but let's go agile here. Step one would be to add the foundation to collect and expose metrics. After that, metrics can be added one after the other (preferably with issues on their own to keep them small).

rabbitmq_cli_consumer_errors_total

@estahn Would it be of interest to track exit codes instead of just a general error (I assume none zero exit codes counts as errors)?

estahn commented 5 years ago

@corvus-ch I have the following 2 metrics so far:

    ProcessCounter = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Namespace: namespace,
            Name:      "process",
            Help:      "Number of messages processed by exit code.",
        },
        []string{"exit_code"},
    )

    ProcessDuration = prometheus.NewHistogram(
        prometheus.HistogramOpts{
            Namespace: namespace,
            Name:      "process_duration_seconds",
            Help:      "",
        },
    )

This results in the following:

# HELP rabbitmq_cli_consumer_process Number of jobs processed by the workers
# TYPE rabbitmq_cli_consumer_process counter
rabbitmq_cli_consumer_process{exit_code="0"} 82
# HELP rabbitmq_cli_consumer_process_duration_seconds 
# TYPE rabbitmq_cli_consumer_process_duration_seconds histogram
rabbitmq_cli_consumer_process_duration_seconds_bucket{le="0.005"} 0
rabbitmq_cli_consumer_process_duration_seconds_bucket{le="0.01"} 0
rabbitmq_cli_consumer_process_duration_seconds_bucket{le="0.025"} 0
rabbitmq_cli_consumer_process_duration_seconds_bucket{le="0.05"} 37
rabbitmq_cli_consumer_process_duration_seconds_bucket{le="0.1"} 76
rabbitmq_cli_consumer_process_duration_seconds_bucket{le="0.25"} 80
rabbitmq_cli_consumer_process_duration_seconds_bucket{le="0.5"} 80
rabbitmq_cli_consumer_process_duration_seconds_bucket{le="1"} 80
rabbitmq_cli_consumer_process_duration_seconds_bucket{le="2.5"} 82
rabbitmq_cli_consumer_process_duration_seconds_bucket{le="5"} 82
rabbitmq_cli_consumer_process_duration_seconds_bucket{le="10"} 82
rabbitmq_cli_consumer_process_duration_seconds_bucket{le="+Inf"} 82
rabbitmq_cli_consumer_process_duration_seconds_sum 6.731983744
rabbitmq_cli_consumer_process_duration_seconds_count 82

I'm happy to rename metrics if other names make more sense.

estahn commented 5 years ago

@corvus-ch Please see #64 as MVP. I feel the default buckets for the histogram might be quite restrictive/small, at least for rabbitmq_cli_consumer_message_duration_seconds, e.g. our service level objective allows for the message to stay for up to 29.5s (given 500ms processing time allowance) in the queue. The largest bucket is currently 10s.

The buckets should be configurable IMO but I left it for the next iteration.

corvus-ch commented 5 years ago

There is a new version available containing the changes made in #64: https://github.com/corvus-ch/rabbitmq-cli-consumer/releases/tag/2.3.3.

Thanks to @estahn for the contribution. Adding additional metrics shall be handled in issues of their own.