camunda / camunda-docs

Camunda 8 Documentation, including all components and features
https://docs.camunda.io/
Other
54 stars 185 forks source link

Document all the Zeebe prometheus metrics #4095

Open zalanszak opened 3 months ago

zalanszak commented 3 months ago

Document all the Zeebe prometheus metrics with explanation - what they mean, what is exactly measured, etc...

zalanszak commented 3 months ago

Hi @akeller This is created from a customer ticket, and the customer would really need this documented: https://jira.camunda.com/browse/SUPPORT-23007

Would it be possible to get a timeline, please?

akeller commented 3 months ago

@conceptualshark do you have some capacity for another topic? You would work with @lenaschoenburg on an iterative approach here (basically what could be delivered now as-is, and what could be delivered in a future state, possibly paired with epic delivery).

conceptualshark commented 3 months ago

I can take this on, though I have a few comments/questions:

Is Grafana the only way to consume this information (where the tooltip is always available)? What is the feasibility of updating the tooltips in Grafana to be more comprehensive/explanatory? This ticket could be updated to include cleaning up the current metrics page to explain this, perhaps highlight some important metrics, etc. If we still think documenting every option is worthwhile, then I could do it from a list that matches the descriptions Grafana//grafana/zeebe.json.

lenaschoenburg commented 3 months ago

I think we have to decide between documenting our Grafana visualizations and documenting the raw metrics. In the support case, the customer was screenshotting a specific visualization and asked questions about their meaning.

However, our Grafana dashboard isn't the only way to consume our metrics. In fact we have customers that scrape the raw metrics and feed them into a completely different monitoring stack without Grafana. They would benefit from documenting the raw metrics.

If we decide for documenting the Grafana dashboard directly, I agree that better tooltips would be the easiest way to do this. If we decide for documenting the raw metrics, we can provide documentation in code which is then available right next to the actual metrics. Take the example from the docs page:

# HELP zeebe_stream_processor_records_total Number of events processed by stream processor
# TYPE zeebe_stream_processor_records_total counter
zeebe_stream_processor_records_total{action="written",partition="1",} 20320.0
zeebe_stream_processor_records_total{action="processed",partition="1",} 20320.0
zeebe_stream_processor_records_total{action="skipped",partition="1",} 2153.0

The help text is something we provided in code.

Regardless of which option we choose, we can then decide if we take these inline docs and also provide them on docs.camunda.io or not.

conceptualshark commented 2 months ago

@lenaschoenburg So here is the list of metrics I was able to find with any description. I'm not sure if this captures all of them, or if all of these are useful to list/expose in the documentation. Right now the docs separates out metrics related to processing, performance, and health.

If the descriptions already included in the code (and what I believe eventually ends up in Grafana) don't convey enough meaning, it might be more useful to update them there, first, and then reflect those changes in the docs.

Name Description
backup_operations_total Total number of backup operations
backup_operations_in_progress Number of backup operations that are in progress
backup_operations_latency Latency of backup operations
checkpoint_records_total Number of checkpoint records processed by stream processor. Processing can result in either creating a new checkpoint or ignoring the record. This can be observed by filtering for label 'result'.
checkpoint_position Position of the last checkpoint
checkpoint_id Id of the last checkpoint
process_instance_execution_time The execution time of processing a complete process instance
job_life_time The life time of an job
job_activation_time The time until an job was activated
execution_latency_current_cached_instances The current cached instances for counting their execution latency. If only short-lived instances are handled this can be seen or observed as the current active instance count.
cluster_topology_version The version of the cluster topology
cluster_changes_id The id of the cluster topology change plan
cluster_changes_status The state of the current cluster topology
cluster_changes_version The version of the cluster topology change plan
cluster_changes_operations_pending Number of pending changes in the current change plan
cluster_changes_operations_completed Number of completed changes in the current change plan
cluster_changes_operation_duration Duration it takes to apply an operation
cluster_changes_operation_attempts Number of attempts per operation type
banned_instances_total Number of banned instances
buffered_messages_count Current number of buffered messages.
incident_events_total Number of incident events
pending_incidents_total Number of pending incidents
job_events_total Number of job events
evaluated_dmn_elements_total Number of evaluated DMN elements including required decisions
executed_instances_total Number of executed (root) process instances
element_instance_events_total Number of process element instance events
process_instance_creations_total Number of created (root) process instances
gateway_request_latency Latency of round-trip from gateway to broker
gateway_failed_requests Number of failed requests
gateway_total_requests Number of requests
long_polling_queued_current Number of requests currently queued due to long polling
stream_processor_batch_processing_duration Time spent in batch processing (in seconds)
stream_processor_batch_processing_commands Records the distribution of commands in a batch over time
stream_processor_batch_processing_post_commit_tasks Time spent in executing post commit tasks after batch processing (in seconds)
stream_processor_batch_processing_retry Number of times batch processing failed due to reaching batch limit and was retried
stream_processor_error_handling_phase The phase of error handling
replay_events_total Number of events replayed by the stream processor.
replay_last_source_position The last source position the stream processor has replayed.
replay_event_batch_replay_duration Time for replay a batch of events (in seconds)
stream_processor_records_total Number of records processed by stream processor
stream_processor_last_processed_position The last position the stream processor has processed.
stream_processor_latency Time between a command is written until it is picked up for processing (in seconds)
stream_processor_processing_duration Time for processing a record (in seconds)
stream_processor_startup_recovery_time Time taken for startup and recovery of stream processor (in ms)
stream_processor_state Describes the state of the stream processor, namely if it is active or paused.
lenaschoenburg commented 2 months ago

I'm not sure if this captures all of them, or if all of these are useful to list/expose in the documentation.

I'm pretty sure we have more metrics with help text than that. As for usefulness: I'm not sure either and I don't know how we'd decide. I think the safest approach would be to just include all of them.

If the descriptions already included in the code (and what I believe eventually ends up in Grafana)

You are right that these descriptions are coming from our code and we should make any improvements there. However, these are not the descriptions seen as tooltips in Grafana. This goes back to a point made earlier:

I think we have to decide between documenting our Grafana visualizations and documenting the raw metrics. In the support case, the customer was screenshotting a specific visualization and asked questions about their meaning.

akeller commented 2 months ago

@lenaschoenburg @conceptualshark, do we need to move this to a Product Hub epic and involve PM? As I read through this thread, I see comments about the descriptions and tooltips in Grafana and maybe where we draw the line for what Camunda supports vs. Grafana (a third-party tool).

If not an epic, @conceptualshark can you break this down into iterable chunks?

conceptualshark commented 2 months ago

@akeller I think this likely needs PM involvement/direction. Where it stands to me is:

Some guidance around if these updates should be made, in what order, etc, would be helpful. If the short-term (or only) solution is to document everything, then I still need a list of all metrics with a reasonable description.

akeller commented 2 months ago

Posted on #ask-product-management - https://camunda.slack.com/archives/C02AWA0RF8A/p1725473481869199

lenaschoenburg commented 1 month ago

We got feedback from PM that we should first focus on documenting the raw metrics directly.

Instead of a one-off effort, we want to automatically generate these docs from the provided help text. Unfortunately, there's not a single source file containing all of these. Instead, we'll have to do something like starting Zeebe, letting it run for a bit (probably at least until it becomes ready) and then scrape the metrics endpoint.

At this point it should be fairly easy to parse the prometheus-specific format, filtering out everything except the help texts and putting them in say a markdown table.