Closed estahn closed 5 years ago
Adding metrics is something I wanted to do for quite some time. So yes, I am all for it.
Are there any other metrics that would be useful?
There certainly are but let's go agile here. Step one would be to add the foundation to collect and expose metrics. After that, metrics can be added one after the other (preferably with issues on their own to keep them small).
rabbitmq_cli_consumer_errors_total
@estahn Would it be of interest to track exit codes instead of just a general error (I assume none zero exit codes counts as errors)?
@corvus-ch I have the following 2 metrics so far:
ProcessCounter = prometheus.NewCounterVec(
prometheus.CounterOpts{
Namespace: namespace,
Name: "process",
Help: "Number of messages processed by exit code.",
},
[]string{"exit_code"},
)
ProcessDuration = prometheus.NewHistogram(
prometheus.HistogramOpts{
Namespace: namespace,
Name: "process_duration_seconds",
Help: "",
},
)
This results in the following:
# HELP rabbitmq_cli_consumer_process Number of jobs processed by the workers
# TYPE rabbitmq_cli_consumer_process counter
rabbitmq_cli_consumer_process{exit_code="0"} 82
# HELP rabbitmq_cli_consumer_process_duration_seconds
# TYPE rabbitmq_cli_consumer_process_duration_seconds histogram
rabbitmq_cli_consumer_process_duration_seconds_bucket{le="0.005"} 0
rabbitmq_cli_consumer_process_duration_seconds_bucket{le="0.01"} 0
rabbitmq_cli_consumer_process_duration_seconds_bucket{le="0.025"} 0
rabbitmq_cli_consumer_process_duration_seconds_bucket{le="0.05"} 37
rabbitmq_cli_consumer_process_duration_seconds_bucket{le="0.1"} 76
rabbitmq_cli_consumer_process_duration_seconds_bucket{le="0.25"} 80
rabbitmq_cli_consumer_process_duration_seconds_bucket{le="0.5"} 80
rabbitmq_cli_consumer_process_duration_seconds_bucket{le="1"} 80
rabbitmq_cli_consumer_process_duration_seconds_bucket{le="2.5"} 82
rabbitmq_cli_consumer_process_duration_seconds_bucket{le="5"} 82
rabbitmq_cli_consumer_process_duration_seconds_bucket{le="10"} 82
rabbitmq_cli_consumer_process_duration_seconds_bucket{le="+Inf"} 82
rabbitmq_cli_consumer_process_duration_seconds_sum 6.731983744
rabbitmq_cli_consumer_process_duration_seconds_count 82
I'm happy to rename metrics if other names make more sense.
@corvus-ch Please see #64 as MVP. I feel the default buckets for the histogram might be quite restrictive/small, at least for rabbitmq_cli_consumer_message_duration_seconds
, e.g. our service level objective allows for the message to stay for up to 29.5s (given 500ms processing time allowance) in the queue. The largest bucket is currently 10s.
The buckets should be configurable IMO but I left it for the next iteration.
There is a new version available containing the changes made in #64: https://github.com/corvus-ch/rabbitmq-cli-consumer/releases/tag/2.3.3.
Thanks to @estahn for the contribution. Adding additional metrics shall be handled in issues of their own.
We're currently scaling consumers based on the number of messages. Our scaling works as follows:
Example:
minReplicas
the targetValue for the HPA should be set to minReplicas * 60 or below to account for some slower workers.This assumes the processing time averages around 500ms which may deviate depending on possible code changes or other external dependencies.
In order to monitor processing times I would like to introduce metrics.
Metrics:
rabbitmq_cli_consumer_processsing_seconds_bucket{le="+Inf|..."}
rabbitmq_cli_consumer_errors_total
Are there any other metrics that would be useful?