Open zalanszak opened 3 months ago
Hi @akeller This is created from a customer ticket, and the customer would really need this documented: https://jira.camunda.com/browse/SUPPORT-23007
Would it be possible to get a timeline, please?
@conceptualshark do you have some capacity for another topic? You would work with @lenaschoenburg on an iterative approach here (basically what could be delivered now as-is, and what could be delivered in a future state, possibly paired with epic delivery).
I can take this on, though I have a few comments/questions:
Is Grafana the only way to consume this information (where the tooltip is always available)? What is the feasibility of updating the tooltips in Grafana to be more comprehensive/explanatory? This ticket could be updated to include cleaning up the current metrics page to explain this, perhaps highlight some important metrics, etc. If we still think documenting every option is worthwhile, then I could do it from a list that matches the descriptions Grafana//grafana/zeebe.json
.
I think we have to decide between documenting our Grafana visualizations and documenting the raw metrics. In the support case, the customer was screenshotting a specific visualization and asked questions about their meaning.
However, our Grafana dashboard isn't the only way to consume our metrics. In fact we have customers that scrape the raw metrics and feed them into a completely different monitoring stack without Grafana. They would benefit from documenting the raw metrics.
If we decide for documenting the Grafana dashboard directly, I agree that better tooltips would be the easiest way to do this. If we decide for documenting the raw metrics, we can provide documentation in code which is then available right next to the actual metrics. Take the example from the docs page:
# HELP zeebe_stream_processor_records_total Number of events processed by stream processor
# TYPE zeebe_stream_processor_records_total counter
zeebe_stream_processor_records_total{action="written",partition="1",} 20320.0
zeebe_stream_processor_records_total{action="processed",partition="1",} 20320.0
zeebe_stream_processor_records_total{action="skipped",partition="1",} 2153.0
The help text is something we provided in code.
Regardless of which option we choose, we can then decide if we take these inline docs and also provide them on docs.camunda.io or not.
@lenaschoenburg So here is the list of metrics I was able to find with any description. I'm not sure if this captures all of them, or if all of these are useful to list/expose in the documentation. Right now the docs separates out metrics related to processing, performance, and health.
If the descriptions already included in the code (and what I believe eventually ends up in Grafana) don't convey enough meaning, it might be more useful to update them there, first, and then reflect those changes in the docs.
Name | Description |
---|---|
backup_operations_total |
Total number of backup operations |
backup_operations_in_progress |
Number of backup operations that are in progress |
backup_operations_latency |
Latency of backup operations |
checkpoint_records_total |
Number of checkpoint records processed by stream processor. Processing can result in either creating a new checkpoint or ignoring the record. This can be observed by filtering for label 'result'. |
checkpoint_position |
Position of the last checkpoint |
checkpoint_id |
Id of the last checkpoint |
process_instance_execution_time |
The execution time of processing a complete process instance |
job_life_time |
The life time of an job |
job_activation_time |
The time until an job was activated |
execution_latency_current_cached_instances |
The current cached instances for counting their execution latency. If only short-lived instances are handled this can be seen or observed as the current active instance count. |
cluster_topology_version |
The version of the cluster topology |
cluster_changes_id |
The id of the cluster topology change plan |
cluster_changes_status |
The state of the current cluster topology |
cluster_changes_version |
The version of the cluster topology change plan |
cluster_changes_operations_pending |
Number of pending changes in the current change plan |
cluster_changes_operations_completed |
Number of completed changes in the current change plan |
cluster_changes_operation_duration |
Duration it takes to apply an operation |
cluster_changes_operation_attempts |
Number of attempts per operation type |
banned_instances_total |
Number of banned instances |
buffered_messages_count |
Current number of buffered messages. |
incident_events_total |
Number of incident events |
pending_incidents_total |
Number of pending incidents |
job_events_total |
Number of job events |
evaluated_dmn_elements_total |
Number of evaluated DMN elements including required decisions |
executed_instances_total |
Number of executed (root) process instances |
element_instance_events_total |
Number of process element instance events |
process_instance_creations_total |
Number of created (root) process instances |
gateway_request_latency |
Latency of round-trip from gateway to broker |
gateway_failed_requests |
Number of failed requests |
gateway_total_requests |
Number of requests |
long_polling_queued_current |
Number of requests currently queued due to long polling |
stream_processor_batch_processing_duration |
Time spent in batch processing (in seconds) |
stream_processor_batch_processing_commands |
Records the distribution of commands in a batch over time |
stream_processor_batch_processing_post_commit_tasks |
Time spent in executing post commit tasks after batch processing (in seconds) |
stream_processor_batch_processing_retry |
Number of times batch processing failed due to reaching batch limit and was retried |
stream_processor_error_handling_phase |
The phase of error handling |
replay_events_total |
Number of events replayed by the stream processor. |
replay_last_source_position |
The last source position the stream processor has replayed. |
replay_event_batch_replay_duration |
Time for replay a batch of events (in seconds) |
stream_processor_records_total |
Number of records processed by stream processor |
stream_processor_last_processed_position |
The last position the stream processor has processed. |
stream_processor_latency |
Time between a command is written until it is picked up for processing (in seconds) |
stream_processor_processing_duration |
Time for processing a record (in seconds) |
stream_processor_startup_recovery_time |
Time taken for startup and recovery of stream processor (in ms) |
stream_processor_state |
Describes the state of the stream processor, namely if it is active or paused. |
I'm not sure if this captures all of them, or if all of these are useful to list/expose in the documentation.
I'm pretty sure we have more metrics with help text than that. As for usefulness: I'm not sure either and I don't know how we'd decide. I think the safest approach would be to just include all of them.
If the descriptions already included in the code (and what I believe eventually ends up in Grafana)
You are right that these descriptions are coming from our code and we should make any improvements there. However, these are not the descriptions seen as tooltips in Grafana. This goes back to a point made earlier:
I think we have to decide between documenting our Grafana visualizations and documenting the raw metrics. In the support case, the customer was screenshotting a specific visualization and asked questions about their meaning.
@lenaschoenburg @conceptualshark, do we need to move this to a Product Hub epic and involve PM? As I read through this thread, I see comments about the descriptions and tooltips in Grafana and maybe where we draw the line for what Camunda supports vs. Grafana (a third-party tool).
If not an epic, @conceptualshark can you break this down into iterable chunks?
@akeller I think this likely needs PM involvement/direction. Where it stands to me is:
Some guidance around if these updates should be made, in what order, etc, would be helpful. If the short-term (or only) solution is to document everything, then I still need a list of all metrics with a reasonable description.
Posted on #ask-product-management - https://camunda.slack.com/archives/C02AWA0RF8A/p1725473481869199
We got feedback from PM that we should first focus on documenting the raw metrics directly.
Instead of a one-off effort, we want to automatically generate these docs from the provided help text. Unfortunately, there's not a single source file containing all of these. Instead, we'll have to do something like starting Zeebe, letting it run for a bit (probably at least until it becomes ready) and then scrape the metrics endpoint.
At this point it should be fairly easy to parse the prometheus-specific format, filtering out everything except the help texts and putting them in say a markdown table.
Document all the Zeebe prometheus metrics with explanation - what they mean, what is exactly measured, etc...