Document all the Zeebe prometheus metrics

zalanszak commented 3 months ago

Document all the Zeebe prometheus metrics with explanation - what they mean, what is exactly measured, etc...

zalanszak commented 3 months ago

Hi @akeller This is created from a customer ticket, and the customer would really need this documented: https://jira.camunda.com/browse/SUPPORT-23007

Would it be possible to get a timeline, please?

akeller commented 3 months ago

@conceptualshark do you have some capacity for another topic? You would work with @lenaschoenburg on an iterative approach here (basically what could be delivered now as-is, and what could be delivered in a future state, possibly paired with epic delivery).

conceptualshark commented 3 months ago

I can take this on, though I have a few comments/questions:

There are a significant number of metrics available, which might be overly complex to document and/or better documented elsewhere
Re: documenting elsewhere, they are all available in Grafana, which includes an information tooltip for each

Is Grafana the only way to consume this information (where the tooltip is always available)? What is the feasibility of updating the tooltips in Grafana to be more comprehensive/explanatory? This ticket could be updated to include cleaning up the current metrics page to explain this, perhaps highlight some important metrics, etc. If we still think documenting every option is worthwhile, then I could do it from a list that matches the descriptions Grafana//grafana/zeebe.json.

lenaschoenburg commented 3 months ago

I think we have to decide between documenting our Grafana visualizations and documenting the raw metrics. In the support case, the customer was screenshotting a specific visualization and asked questions about their meaning.

However, our Grafana dashboard isn't the only way to consume our metrics. In fact we have customers that scrape the raw metrics and feed them into a completely different monitoring stack without Grafana. They would benefit from documenting the raw metrics.

If we decide for documenting the Grafana dashboard directly, I agree that better tooltips would be the easiest way to do this. If we decide for documenting the raw metrics, we can provide documentation in code which is then available right next to the actual metrics. Take the example from the docs page:

# HELP zeebe_stream_processor_records_total Number of events processed by stream processor
# TYPE zeebe_stream_processor_records_total counter
zeebe_stream_processor_records_total{action="written",partition="1",} 20320.0
zeebe_stream_processor_records_total{action="processed",partition="1",} 20320.0
zeebe_stream_processor_records_total{action="skipped",partition="1",} 2153.0

The help text is something we provided in code.

Regardless of which option we choose, we can then decide if we take these inline docs and also provide them on docs.camunda.io or not.

conceptualshark commented 2 months ago

@lenaschoenburg So here is the list of metrics I was able to find with any description. I'm not sure if this captures all of them, or if all of these are useful to list/expose in the documentation. Right now the docs separates out metrics related to processing, performance, and health.

Are these the best sections to maintain?
Should we add additional sections?
Are the descriptions below useful?

If the descriptions already included in the code (and what I believe eventually ends up in Grafana) don't convey enough meaning, it might be more useful to update them there, first, and then reflect those changes in the docs.

Name	Description
`backup_operations_total`	Total number of backup operations
`backup_operations_in_progress`	Number of backup operations that are in progress
`backup_operations_latency`	Latency of backup operations
`checkpoint_records_total`	Number of checkpoint records processed by stream processor. Processing can result in either creating a new checkpoint or ignoring the record. This can be observed by filtering for label 'result'.
`checkpoint_position`	Position of the last checkpoint
`checkpoint_id`	Id of the last checkpoint
`process_instance_execution_time`	The execution time of processing a complete process instance
`job_life_time`	The life time of an job
`job_activation_time`	The time until an job was activated
`execution_latency_current_cached_instances`	The current cached instances for counting their execution latency. If only short-lived instances are handled this can be seen or observed as the current active instance count.
`cluster_topology_version`	The version of the cluster topology
`cluster_changes_id`	The id of the cluster topology change plan
`cluster_changes_status`	The state of the current cluster topology
`cluster_changes_version`	The version of the cluster topology change plan
`cluster_changes_operations_pending`	Number of pending changes in the current change plan
`cluster_changes_operations_completed`	Number of completed changes in the current change plan
`cluster_changes_operation_duration`	Duration it takes to apply an operation
`cluster_changes_operation_attempts`	Number of attempts per operation type
`banned_instances_total`	Number of banned instances
`buffered_messages_count`	Current number of buffered messages.
`incident_events_total`	Number of incident events
`pending_incidents_total`	Number of pending incidents
`job_events_total`	Number of job events
`evaluated_dmn_elements_total`	Number of evaluated DMN elements including required decisions
`executed_instances_total`	Number of executed (root) process instances
`element_instance_events_total`	Number of process element instance events
`process_instance_creations_total`	Number of created (root) process instances
`gateway_request_latency`	Latency of round-trip from gateway to broker
`gateway_failed_requests`	Number of failed requests
`gateway_total_requests`	Number of requests
`long_polling_queued_current`	Number of requests currently queued due to long polling
`stream_processor_batch_processing_duration`	Time spent in batch processing (in seconds)
`stream_processor_batch_processing_commands`	Records the distribution of commands in a batch over time
`stream_processor_batch_processing_post_commit_tasks`	Time spent in executing post commit tasks after batch processing (in seconds)
`stream_processor_batch_processing_retry`	Number of times batch processing failed due to reaching batch limit and was retried
`stream_processor_error_handling_phase`	The phase of error handling
`replay_events_total`	Number of events replayed by the stream processor.
`replay_last_source_position`	The last source position the stream processor has replayed.
`replay_event_batch_replay_duration`	Time for replay a batch of events (in seconds)
`stream_processor_records_total`	Number of records processed by stream processor
`stream_processor_last_processed_position`	The last position the stream processor has processed.
`stream_processor_latency`	Time between a command is written until it is picked up for processing (in seconds)
`stream_processor_processing_duration`	Time for processing a record (in seconds)
`stream_processor_startup_recovery_time`	Time taken for startup and recovery of stream processor (in ms)
`stream_processor_state`	Describes the state of the stream processor, namely if it is active or paused.

lenaschoenburg commented 2 months ago

I'm not sure if this captures all of them, or if all of these are useful to list/expose in the documentation.

I'm pretty sure we have more metrics with help text than that. As for usefulness: I'm not sure either and I don't know how we'd decide. I think the safest approach would be to just include all of them.

If the descriptions already included in the code (and what I believe eventually ends up in Grafana)

You are right that these descriptions are coming from our code and we should make any improvements there. However, these are not the descriptions seen as tooltips in Grafana. This goes back to a point made earlier:

I think we have to decide between documenting our Grafana visualizations and documenting the raw metrics. In the support case, the customer was screenshotting a specific visualization and asked questions about their meaning.

akeller commented 2 months ago

@lenaschoenburg @conceptualshark, do we need to move this to a Product Hub epic and involve PM? As I read through this thread, I see comments about the descriptions and tooltips in Grafana and maybe where we draw the line for what Camunda supports vs. Grafana (a third-party tool).

If not an epic, @conceptualshark can you break this down into iterable chunks?

conceptualshark commented 2 months ago

@akeller I think this likely needs PM involvement/direction. Where it stands to me is:

We have a long list of raw metrics (https://github.com/camunda/camunda-docs/issues/4095#issuecomment-2321891742, which may not include all of them) that a user can scrape and monitor on their own; docs would need a complete list
The descriptions provided for these metrics could be improved in the code; updated descriptions could then be used in the docs
The raw metrics are related to, but separate from, the grafana descriptions (the reported issue); these descriptions could also be updated in the code to provide more feedback to users via help text

Some guidance around if these updates should be made, in what order, etc, would be helpful. If the short-term (or only) solution is to document everything, then I still need a list of all metrics with a reasonable description.

akeller commented 2 months ago

Posted on #ask-product-management - https://camunda.slack.com/archives/C02AWA0RF8A/p1725473481869199

lenaschoenburg commented 1 month ago

We got feedback from PM that we should first focus on documenting the raw metrics directly.

Instead of a one-off effort, we want to automatically generate these docs from the provided help text. Unfortunately, there's not a single source file containing all of these. Instead, we'll have to do something like starting Zeebe, letting it run for a bit (probably at least until it becomes ready) and then scrape the metrics endpoint.

At this point it should be fairly easy to parse the prometheus-specific format, filtering out everything except the help texts and putting them in say a markdown table.

camunda / camunda-docs

Document all the Zeebe prometheus metrics #4095