Open JackyWoo opened 10 months ago
@lzydmxy please take a look at this issue
Zookeeper 3.9 has more metrics than 3.5, we can refer to it.
@JackyWoo These are all the monitoring items for zk 3.9. We can start by adding latency metrics for the core links of raft, equivalent to zk's sync_processor_queue_time_ms
and sync_processor_queue_flush_time_ms
.
zk_version 3.9.0-1674a5e97f43bc38e9bf56b04f83a7ae34d68249, built on 2023-07-19 09:09 UTC
zk_server_state standalone
zk_ephemerals_count 0
zk_num_alive_connections 1
zk_avg_latency 0.0
zk_outstanding_requests 0
zk_znode_count 5
zk_global_sessions 0
zk_non_mtls_remote_conn_count 0
zk_last_client_response_size -1
zk_packets_sent 1
zk_packets_received 2
zk_max_client_response_size -1
zk_connection_drop_probability 0.0
zk_watch_count 0
zk_auth_failed_count 0
zk_min_latency 0
zk_max_file_descriptor_count 204800
zk_approximate_data_size 44
zk_open_file_descriptor_count 94
zk_local_sessions 0
zk_uptime 29500
zk_max_latency 0
zk_outstanding_tls_handshake 0
zk_min_client_response_size -1
zk_non_mtls_local_conn_count 0
zk_watch_bytes 0
zk_stale_requests_dropped 0
zk_throttled_ops 0
zk_insecure_admin_count 0
zk_connection_rejected 0
zk_sessionless_connections_expired 0
zk_dead_watchers_queued 0
zk_stale_requests 0
zk_connection_drop_count 0
zk_response_packet_cache_hits 0
zk_bytes_received_count 8
zk_add_dead_watcher_stall_time 0
zk_request_throttle_wait_count 0
zk_requests_not_forwarded_to_commit_processor 0
zk_response_packet_cache_misses 0
zk_prep_processor_request_queued 0
zk_stale_replies 0
zk_response_bytes 0
zk_ensemble_auth_fail 0
zk_diff_count 0
zk_connection_revalidate_count 0
zk_quit_leading_due_to_disloyal_voter 0
zk_unrecoverable_error_count 0
zk_unsuccessful_handshake 0
zk_commit_count 0
zk_outstanding_changes_queued 0
zk_request_commit_queued 0
zk_ensemble_auth_skip 0
zk_skip_learner_request_to_next_processor_count 0
zk_proposal_count 0
zk_large_requests_rejected 0
zk_outstanding_changes_removed 0
zk_restore_error_count 0
zk_cnxn_closed_without_zk_server_running 0
zk_looking_count 0
zk_snapshot_rate_limited_count 0
zk_learner_proposal_received_count 0
zk_digest_mismatches_count 0
zk_dead_watchers_cleared 0
zk_ensemble_auth_success 0
zk_learner_commit_received_count 0
zk_snapshot_error_count 0
zk_connection_request_count 0
zk_response_packet_get_children_cache_misses 0
zk_snap_count 0
zk_stale_sessions_expired 0
zk_restore_rate_limited_count 0
zk_response_packet_get_children_cache_hits 0
zk_sync_processor_request_queued 0
zk_tls_handshake_exceeded 0
zk_revalidate_count 0
zk_avg_socket_closing_time 0.0
zk_min_socket_closing_time 0
zk_max_socket_closing_time 0
zk_cnt_socket_closing_time 0
zk_sum_socket_closing_time 0
zk_avg_proposal_process_time 0.0
zk_min_proposal_process_time 0
zk_max_proposal_process_time 0
zk_cnt_proposal_process_time 0
zk_sum_proposal_process_time 0
zk_avg_leader_unavailable_time 0.0
zk_min_leader_unavailable_time 0
zk_max_leader_unavailable_time 0
zk_cnt_leader_unavailable_time 0
zk_sum_leader_unavailable_time 0
zk_avg_node_created_watch_count 0.0
zk_min_node_created_watch_count 0
zk_max_node_created_watch_count 0
zk_cnt_node_created_watch_count 0
zk_sum_node_created_watch_count 0
zk_avg_session_queues_drained 0.0
zk_min_session_queues_drained 0
zk_max_session_queues_drained 0
zk_cnt_session_queues_drained 0
zk_sum_session_queues_drained 0
zk_avg_write_commit_proc_req_queued 0.0
zk_min_write_commit_proc_req_queued 0
zk_max_write_commit_proc_req_queued 0
zk_cnt_write_commit_proc_req_queued 0
zk_sum_write_commit_proc_req_queued 0
zk_avg_connection_token_deficit 0.0
zk_min_connection_token_deficit 0
zk_max_connection_token_deficit 0
zk_cnt_connection_token_deficit 0
zk_sum_connection_token_deficit 0
zk_avg_read_commit_proc_req_queued 0.0
zk_min_read_commit_proc_req_queued 0
zk_max_read_commit_proc_req_queued 0
zk_cnt_read_commit_proc_req_queued 0
zk_sum_read_commit_proc_req_queued 0
zk_avg_node_deleted_watch_count 0.0
zk_min_node_deleted_watch_count 0
zk_max_node_deleted_watch_count 0
zk_cnt_node_deleted_watch_count 0
zk_sum_node_deleted_watch_count 0
zk_avg_startup_txns_load_time 0.0
zk_min_startup_txns_load_time 0
zk_max_startup_txns_load_time 0
zk_cnt_startup_txns_load_time 0
zk_sum_startup_txns_load_time 0
zk_avg_sync_processor_queue_size 0.0
zk_min_sync_processor_queue_size 0
zk_max_sync_processor_queue_size 0
zk_cnt_sync_processor_queue_size 1
zk_sum_sync_processor_queue_size 0
zk_avg_follower_sync_time 0.0
zk_min_follower_sync_time 0
zk_max_follower_sync_time 0
zk_cnt_follower_sync_time 0
zk_sum_follower_sync_time 0
zk_avg_prep_processor_queue_size 0.0
zk_min_prep_processor_queue_size 0
zk_max_prep_processor_queue_size 0
zk_cnt_prep_processor_queue_size 1
zk_sum_prep_processor_queue_size 0
zk_avg_fsynctime 0.0
zk_min_fsynctime 0
zk_max_fsynctime 0
zk_cnt_fsynctime 0
zk_sum_fsynctime 0
zk_avg_inflight_snap_count 0.0
zk_min_inflight_snap_count 0
zk_max_inflight_snap_count 0
zk_cnt_inflight_snap_count 0
zk_sum_inflight_snap_count 0
zk_avg_reads_issued_from_session_queue 0.0
zk_min_reads_issued_from_session_queue 0
zk_max_reads_issued_from_session_queue 0
zk_cnt_reads_issued_from_session_queue 0
zk_sum_reads_issued_from_session_queue 0
zk_avg_restore_time 0.0
zk_min_restore_time 0
zk_max_restore_time 0
zk_cnt_restore_time 0
zk_sum_restore_time 0
zk_avg_learner_request_processor_queue_size 0.0
zk_min_learner_request_processor_queue_size 0
zk_max_learner_request_processor_queue_size 0
zk_cnt_learner_request_processor_queue_size 0
zk_sum_learner_request_processor_queue_size 0
zk_avg_snapshottime 1.0
zk_min_snapshottime 1
zk_max_snapshottime 1
zk_cnt_snapshottime 1
zk_sum_snapshottime 1
zk_avg_unavailable_time 0.0
zk_min_unavailable_time 0
zk_max_unavailable_time 0
zk_cnt_unavailable_time 0
zk_sum_unavailable_time 0
zk_avg_startup_txns_loaded 0.0
zk_min_startup_txns_loaded 0
zk_max_startup_txns_loaded 0
zk_cnt_startup_txns_loaded 0
zk_sum_startup_txns_loaded 0
zk_avg_reads_after_write_in_session_queue 0.0
zk_min_reads_after_write_in_session_queue 0
zk_max_reads_after_write_in_session_queue 0
zk_cnt_reads_after_write_in_session_queue 0
zk_sum_reads_after_write_in_session_queue 0
zk_avg_requests_in_session_queue 0.0
zk_min_requests_in_session_queue 0
zk_max_requests_in_session_queue 0
zk_cnt_requests_in_session_queue 0
zk_sum_requests_in_session_queue 0
zk_avg_write_commit_proc_issued 0.0
zk_min_write_commit_proc_issued 0
zk_max_write_commit_proc_issued 0
zk_cnt_write_commit_proc_issued 0
zk_sum_write_commit_proc_issued 0
zk_avg_prep_process_time 0.0
zk_min_prep_process_time 0
zk_max_prep_process_time 0
zk_cnt_prep_process_time 0
zk_sum_prep_process_time 0
zk_avg_pending_session_queue_size 0.0
zk_min_pending_session_queue_size 0
zk_max_pending_session_queue_size 0
zk_cnt_pending_session_queue_size 0
zk_sum_pending_session_queue_size 0
zk_avg_time_waiting_empty_pool_in_commit_processor_read_ms 0.0
zk_min_time_waiting_empty_pool_in_commit_processor_read_ms 0
zk_max_time_waiting_empty_pool_in_commit_processor_read_ms 0
zk_cnt_time_waiting_empty_pool_in_commit_processor_read_ms 0
zk_sum_time_waiting_empty_pool_in_commit_processor_read_ms 0
zk_avg_commit_process_time 0.0
zk_min_commit_process_time 0
zk_max_commit_process_time 0
zk_cnt_commit_process_time 0
zk_sum_commit_process_time 0
zk_avg_dbinittime 6.0
zk_min_dbinittime 6
zk_max_dbinittime 6
zk_cnt_dbinittime 1
zk_sum_dbinittime 6
zk_avg_inflight_diff_count 0.0
zk_min_inflight_diff_count 0
zk_max_inflight_diff_count 0
zk_cnt_inflight_diff_count 0
zk_sum_inflight_diff_count 0
zk_avg_netty_queued_buffer_capacity 0.0
zk_min_netty_queued_buffer_capacity 0
zk_max_netty_queued_buffer_capacity 0
zk_cnt_netty_queued_buffer_capacity 0
zk_sum_netty_queued_buffer_capacity 0
zk_avg_election_time 0.0
zk_min_election_time 0
zk_max_election_time 0
zk_cnt_election_time 0
zk_sum_election_time 0
zk_avg_commit_commit_proc_req_queued 0.0
zk_min_commit_commit_proc_req_queued 0
zk_max_commit_commit_proc_req_queued 0
zk_cnt_commit_commit_proc_req_queued 0
zk_sum_commit_commit_proc_req_queued 0
zk_avg_sync_processor_batch_size 0.0
zk_min_sync_processor_batch_size 0
zk_max_sync_processor_batch_size 0
zk_cnt_sync_processor_batch_size 0
zk_sum_sync_processor_batch_size 0
zk_avg_node_children_watch_count 0.0
zk_min_node_children_watch_count 0
zk_max_node_children_watch_count 0
zk_cnt_node_children_watch_count 0
zk_sum_node_children_watch_count 0
zk_avg_write_batch_time_in_commit_processor 0.0
zk_min_write_batch_time_in_commit_processor 0
zk_max_write_batch_time_in_commit_processor 0
zk_cnt_write_batch_time_in_commit_processor 0
zk_sum_write_batch_time_in_commit_processor 0
zk_avg_read_commit_proc_issued 0.0
zk_min_read_commit_proc_issued 0
zk_max_read_commit_proc_issued 0
zk_cnt_read_commit_proc_issued 0
zk_sum_read_commit_proc_issued 0
zk_avg_concurrent_request_processing_in_commit_processor 0.0
zk_min_concurrent_request_processing_in_commit_processor 0
zk_max_concurrent_request_processing_in_commit_processor 0
zk_cnt_concurrent_request_processing_in_commit_processor 0
zk_sum_concurrent_request_processing_in_commit_processor 0
zk_avg_observer_sync_time 0.0
zk_min_observer_sync_time 0
zk_max_observer_sync_time 0
zk_cnt_observer_sync_time 0
zk_sum_observer_sync_time 0
zk_avg_node_changed_watch_count 0.0
zk_min_node_changed_watch_count 0
zk_max_node_changed_watch_count 0
zk_cnt_node_changed_watch_count 0
zk_sum_node_changed_watch_count 0
zk_avg_sync_process_time 0.0
zk_min_sync_process_time 0
zk_max_sync_process_time 0
zk_cnt_sync_process_time 0
zk_sum_sync_process_time 0
zk_avg_startup_snap_load_time 1.0
zk_min_startup_snap_load_time 1
zk_max_startup_snap_load_time 1
zk_cnt_startup_snap_load_time 1
zk_sum_startup_snap_load_time 1
zk_avg_prep_processor_queue_time_ms 0.0
zk_min_prep_processor_queue_time_ms 0
zk_max_prep_processor_queue_time_ms 0
zk_cnt_prep_processor_queue_time_ms 0
zk_sum_prep_processor_queue_time_ms 0
zk_p50_prep_processor_queue_time_ms 0
zk_p95_prep_processor_queue_time_ms 0
zk_p99_prep_processor_queue_time_ms 0
zk_p999_prep_processor_queue_time_ms 0
zk_avg_jvm_pause_time_ms 0.0
zk_min_jvm_pause_time_ms 0
zk_max_jvm_pause_time_ms 0
zk_cnt_jvm_pause_time_ms 0
zk_sum_jvm_pause_time_ms 0
zk_p50_jvm_pause_time_ms 0
zk_p95_jvm_pause_time_ms 0
zk_p99_jvm_pause_time_ms 0
zk_p999_jvm_pause_time_ms 0
zk_avg_close_session_prep_time 0.0
zk_min_close_session_prep_time 0
zk_max_close_session_prep_time 0
zk_cnt_close_session_prep_time 0
zk_sum_close_session_prep_time 0
zk_p50_close_session_prep_time 0
zk_p95_close_session_prep_time 0
zk_p99_close_session_prep_time 0
zk_p999_close_session_prep_time 0
zk_avg_read_commitproc_time_ms 0.0
zk_min_read_commitproc_time_ms 0
zk_max_read_commitproc_time_ms 0
zk_cnt_read_commitproc_time_ms 0
zk_sum_read_commitproc_time_ms 0
zk_p50_read_commitproc_time_ms 0
zk_p95_read_commitproc_time_ms 0
zk_p99_read_commitproc_time_ms 0
zk_p999_read_commitproc_time_ms 0
zk_avg_updatelatency 0.0
zk_min_updatelatency 0
zk_max_updatelatency 0
zk_cnt_updatelatency 0
zk_sum_updatelatency 0
zk_p50_updatelatency 0
zk_p95_updatelatency 0
zk_p99_updatelatency 0
zk_p999_updatelatency 0
zk_avg_local_write_committed_time_ms 0.0
zk_min_local_write_committed_time_ms 0
zk_max_local_write_committed_time_ms 0
zk_cnt_local_write_committed_time_ms 0
zk_sum_local_write_committed_time_ms 0
zk_p50_local_write_committed_time_ms 0
zk_p95_local_write_committed_time_ms 0
zk_p99_local_write_committed_time_ms 0
zk_p999_local_write_committed_time_ms 0
zk_avg_request_throttle_queue_time_ms 0.0
zk_min_request_throttle_queue_time_ms 0
zk_max_request_throttle_queue_time_ms 0
zk_cnt_request_throttle_queue_time_ms 0
zk_sum_request_throttle_queue_time_ms 0
zk_p50_request_throttle_queue_time_ms 0
zk_p95_request_throttle_queue_time_ms 0
zk_p99_request_throttle_queue_time_ms 0
zk_p999_request_throttle_queue_time_ms 0
zk_avg_readlatency 0.0
zk_min_readlatency 0
zk_max_readlatency 0
zk_cnt_readlatency 0
zk_sum_readlatency 0
zk_p50_readlatency 0
zk_p95_readlatency 0
zk_p99_readlatency 0
zk_p999_readlatency 0
zk_avg_quorum_ack_latency 0.0
zk_min_quorum_ack_latency 0
zk_max_quorum_ack_latency 0
zk_cnt_quorum_ack_latency 0
zk_sum_quorum_ack_latency 0
zk_p50_quorum_ack_latency 0
zk_p95_quorum_ack_latency 0
zk_p99_quorum_ack_latency 0
zk_p999_quorum_ack_latency 0
zk_avg_om_commit_process_time_ms 0.0
zk_min_om_commit_process_time_ms 0
zk_max_om_commit_process_time_ms 0
zk_cnt_om_commit_process_time_ms 0
zk_sum_om_commit_process_time_ms 0
zk_p50_om_commit_process_time_ms 0
zk_p95_om_commit_process_time_ms 0
zk_p99_om_commit_process_time_ms 0
zk_p999_om_commit_process_time_ms 0
zk_avg_read_final_proc_time_ms 0.0
zk_min_read_final_proc_time_ms 0
zk_max_read_final_proc_time_ms 0
zk_cnt_read_final_proc_time_ms 0
zk_sum_read_final_proc_time_ms 0
zk_p50_read_final_proc_time_ms 0
zk_p95_read_final_proc_time_ms 0
zk_p99_read_final_proc_time_ms 0
zk_p999_read_final_proc_time_ms 0
zk_avg_commit_propagation_latency 0.0
zk_min_commit_propagation_latency 0
zk_max_commit_propagation_latency 0
zk_cnt_commit_propagation_latency 0
zk_sum_commit_propagation_latency 0
zk_p50_commit_propagation_latency 0
zk_p95_commit_propagation_latency 0
zk_p99_commit_propagation_latency 0
zk_p999_commit_propagation_latency 0
zk_avg_dead_watchers_cleaner_latency 0.0
zk_min_dead_watchers_cleaner_latency 0
zk_max_dead_watchers_cleaner_latency 0
zk_cnt_dead_watchers_cleaner_latency 0
zk_sum_dead_watchers_cleaner_latency 0
zk_p50_dead_watchers_cleaner_latency 0
zk_p95_dead_watchers_cleaner_latency 0
zk_p99_dead_watchers_cleaner_latency 0
zk_p999_dead_watchers_cleaner_latency 0
zk_avg_write_final_proc_time_ms 0.0
zk_min_write_final_proc_time_ms 0
zk_max_write_final_proc_time_ms 0
zk_cnt_write_final_proc_time_ms 0
zk_sum_write_final_proc_time_ms 0
zk_p50_write_final_proc_time_ms 0
zk_p95_write_final_proc_time_ms 0
zk_p99_write_final_proc_time_ms 0
zk_p999_write_final_proc_time_ms 0
zk_avg_proposal_ack_creation_latency 0.0
zk_min_proposal_ack_creation_latency 0
zk_max_proposal_ack_creation_latency 0
zk_cnt_proposal_ack_creation_latency 0
zk_sum_proposal_ack_creation_latency 0
zk_p50_proposal_ack_creation_latency 0
zk_p95_proposal_ack_creation_latency 0
zk_p99_proposal_ack_creation_latency 0
zk_p999_proposal_ack_creation_latency 0
zk_avg_proposal_latency 0.0
zk_min_proposal_latency 0
zk_max_proposal_latency 0
zk_cnt_proposal_latency 0
zk_sum_proposal_latency 0
zk_p50_proposal_latency 0
zk_p95_proposal_latency 0
zk_p99_proposal_latency 0
zk_p999_proposal_latency 0
zk_avg_om_proposal_process_time_ms 0.0
zk_min_om_proposal_process_time_ms 0
zk_max_om_proposal_process_time_ms 0
zk_cnt_om_proposal_process_time_ms 0
zk_sum_om_proposal_process_time_ms 0
zk_p50_om_proposal_process_time_ms 0
zk_p95_om_proposal_process_time_ms 0
zk_p99_om_proposal_process_time_ms 0
zk_p999_om_proposal_process_time_ms 0
zk_avg_sync_processor_queue_and_flush_time_ms 0.0
zk_min_sync_processor_queue_and_flush_time_ms 0
zk_max_sync_processor_queue_and_flush_time_ms 0
zk_cnt_sync_processor_queue_and_flush_time_ms 0
zk_sum_sync_processor_queue_and_flush_time_ms 0
zk_p50_sync_processor_queue_and_flush_time_ms 0
zk_p95_sync_processor_queue_and_flush_time_ms 0
zk_p99_sync_processor_queue_and_flush_time_ms 0
zk_p999_sync_processor_queue_and_flush_time_ms 0
zk_avg_propagation_latency 0.0
zk_min_propagation_latency 0
zk_max_propagation_latency 0
zk_cnt_propagation_latency 0
zk_sum_propagation_latency 0
zk_p50_propagation_latency 0
zk_p95_propagation_latency 0
zk_p99_propagation_latency 0
zk_p999_propagation_latency 0
zk_avg_server_write_committed_time_ms 0.0
zk_min_server_write_committed_time_ms 0
zk_max_server_write_committed_time_ms 0
zk_cnt_server_write_committed_time_ms 0
zk_sum_server_write_committed_time_ms 0
zk_p50_server_write_committed_time_ms 0
zk_p95_server_write_committed_time_ms 0
zk_p99_server_write_committed_time_ms 0
zk_p999_server_write_committed_time_ms 0
zk_avg_sync_processor_queue_time_ms 0.0
zk_min_sync_processor_queue_time_ms 0
zk_max_sync_processor_queue_time_ms 0
zk_cnt_sync_processor_queue_time_ms 0
zk_sum_sync_processor_queue_time_ms 0
zk_p50_sync_processor_queue_time_ms 0
zk_p95_sync_processor_queue_time_ms 0
zk_p99_sync_processor_queue_time_ms 0
zk_p999_sync_processor_queue_time_ms 0
zk_avg_sync_processor_queue_flush_time_ms 0.0
zk_min_sync_processor_queue_flush_time_ms 0
zk_max_sync_processor_queue_flush_time_ms 0
zk_cnt_sync_processor_queue_flush_time_ms 0
zk_sum_sync_processor_queue_flush_time_ms 0
zk_p50_sync_processor_queue_flush_time_ms 0
zk_p95_sync_processor_queue_flush_time_ms 0
zk_p99_sync_processor_queue_flush_time_ms 0
zk_p999_sync_processor_queue_flush_time_ms 0
zk_avg_write_commitproc_time_ms 0.0
zk_min_write_commitproc_time_ms 0
zk_max_write_commitproc_time_ms 0
zk_cnt_write_commitproc_time_ms 0
zk_sum_write_commitproc_time_ms 0
zk_p50_write_commitproc_time_ms 0
zk_p95_write_commitproc_time_ms 0
zk_p99_write_commitproc_time_ms 0
zk_p999_write_commitproc_time_ms 0
Description
Right now the only way to introspect RaftKeeper is 4lw command which is based on Zookeeper 3.5. We fould the metrics is too simple to known what is the internal happening.
So we should better enhance the monitoring system. The basic ieade is to enhance 4lw command but not add promethus system, because it is a more simple way and will not introduce stuff for users.
The following are some metrics we need:
Are you willing to submit PR?