`serviceMonitor.enabled: true` does not work, or requires additional steps

jonathon2nd commented 2 months ago

The metrics endpoint seems to point to a protected endpoint, and unsure of what steps are needed if any. Endpoint is https, but monitor is set to http.

Helm install. REPO URL https://github.com/dragonflydb/dragonfly-operator.git TARGET REVISION v1.1.7

Values:

serviceMonitor.enabled: true
fullnameOverride: preprod-dragonfly-op

jonathon2nd commented 2 months ago

Hmm, I was able to manually get it working, but not that useful right now.

looking at the logs

NFO setup   starting manager
INFO    controller-runtime.metrics  Starting metrics server
INFO    starting server {"name": "health probe", "addr": "[::]:8081"}
INFO    controller-runtime.metrics  Serving metrics server  {"bindAddress": "127.0.0.1:8080", "secure": false}

It should be up on port 8080 on the manager container. But even when I manually add it is not returning anything.

root@test-5cd46f97b-fpr6w:/# curl http://10.43.90.184:8989/metrics
curl: (7) Failed to connect to 10.43.90.184 port 8989 after 0 ms: Connection refused
root@test-5cd46f97b-fpr6w:/# curl https://10.43.90.184:8443 -k
Unauthorized

I had to modify the metrics server {"bindAddress": "127.0.0.1:8080", to be :8080 instead for me to make sure that it was running

jonathon2nd commented 2 months ago

hmm, I am not seeing anything particularly useful in the metrics anyway at this time. I was hoping that I would write some prometheus rules to monitor for issues with either the operator or any of the deployed dragonflys, but does not look possible at this time.

# HELP certwatcher_read_certificate_errors_total Total number of certificate read errors
# TYPE certwatcher_read_certificate_errors_total counter
certwatcher_read_certificate_errors_total 0
# HELP certwatcher_read_certificate_total Total number of certificate reads
# TYPE certwatcher_read_certificate_total counter
certwatcher_read_certificate_total 0
# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 0.000107512
go_gc_duration_seconds{quantile="0.25"} 0.000107512
go_gc_duration_seconds{quantile="0.5"} 0.000549011
go_gc_duration_seconds{quantile="0.75"} 0.000549011
go_gc_duration_seconds{quantile="1"} 0.000549011
go_gc_duration_seconds_sum 0.000656523
go_gc_duration_seconds_count 2
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 23
# HELP go_info Information about the Go environment.
# TYPE go_info gauge
go_info{version="go1.22.6"} 1
# HELP go_memstats_alloc_bytes Number of bytes allocated and still in use.
# TYPE go_memstats_alloc_bytes gauge
go_memstats_alloc_bytes 2.953184e+06
# HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed.
# TYPE go_memstats_alloc_bytes_total counter
go_memstats_alloc_bytes_total 4.706192e+06
# HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table.
# TYPE go_memstats_buck_hash_sys_bytes gauge
go_memstats_buck_hash_sys_bytes 1.450269e+06
# HELP go_memstats_frees_total Total number of frees.
# TYPE go_memstats_frees_total counter
go_memstats_frees_total 14552
# HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata.
# TYPE go_memstats_gc_sys_bytes gauge
go_memstats_gc_sys_bytes 2.96432e+06
# HELP go_memstats_heap_alloc_bytes Number of heap bytes allocated and still in use.
# TYPE go_memstats_heap_alloc_bytes gauge
go_memstats_heap_alloc_bytes 2.953184e+06
# HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used.
# TYPE go_memstats_heap_idle_bytes gauge
go_memstats_heap_idle_bytes 2.138112e+06
# HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use.
# TYPE go_memstats_heap_inuse_bytes gauge
go_memstats_heap_inuse_bytes 5.36576e+06
# HELP go_memstats_heap_objects Number of allocated objects.
# TYPE go_memstats_heap_objects gauge
go_memstats_heap_objects 12901
# HELP go_memstats_heap_released_bytes Number of heap bytes released to OS.
# TYPE go_memstats_heap_released_bytes gauge
go_memstats_heap_released_bytes 2.138112e+06
# HELP go_memstats_heap_sys_bytes Number of heap bytes obtained from system.
# TYPE go_memstats_heap_sys_bytes gauge
go_memstats_heap_sys_bytes 7.503872e+06
# HELP go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection.
# TYPE go_memstats_last_gc_time_seconds gauge
go_memstats_last_gc_time_seconds 1.7254775598517356e+09
# HELP go_memstats_lookups_total Total number of pointer lookups.
# TYPE go_memstats_lookups_total counter
go_memstats_lookups_total 0
# HELP go_memstats_mallocs_total Total number of mallocs.
# TYPE go_memstats_mallocs_total counter
go_memstats_mallocs_total 27453
# HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures.
# TYPE go_memstats_mcache_inuse_bytes gauge
go_memstats_mcache_inuse_bytes 19200
# HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system.
# TYPE go_memstats_mcache_sys_bytes gauge
go_memstats_mcache_sys_bytes 31200
# HELP go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures.
# TYPE go_memstats_mspan_inuse_bytes gauge
go_memstats_mspan_inuse_bytes 136800
# HELP go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system.
# TYPE go_memstats_mspan_sys_bytes gauge
go_memstats_mspan_sys_bytes 146880
# HELP go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place.
# TYPE go_memstats_next_gc_bytes gauge
go_memstats_next_gc_bytes 5.294592e+06
# HELP go_memstats_other_sys_bytes Number of bytes used for other system allocations.
# TYPE go_memstats_other_sys_bytes gauge
go_memstats_other_sys_bytes 1.786091e+06
# HELP go_memstats_stack_inuse_bytes Number of bytes in use by the stack allocator.
# TYPE go_memstats_stack_inuse_bytes gauge
go_memstats_stack_inuse_bytes 884736
# HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator.
# TYPE go_memstats_stack_sys_bytes gauge
go_memstats_stack_sys_bytes 884736
# HELP go_memstats_sys_bytes Number of bytes obtained from system.
# TYPE go_memstats_sys_bytes gauge
go_memstats_sys_bytes 1.4767368e+07
# HELP go_threads Number of OS threads created.
# TYPE go_threads gauge
go_threads 11
# HELP leader_election_master_status Gauge of if the reporting system is master of the relevant lease, 0 indicates backup, 1 indicates master. 'name' is the string used to identify the lease. Please make sure to group by name.
# TYPE leader_election_master_status gauge
leader_election_master_status{name="31079dea.dragonflydb.io"} 0
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 0.12
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1.048576e+06
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 11
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 3.2641024e+07
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.72547755893e+09
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 1.297506304e+09
# HELP process_virtual_memory_max_bytes Maximum amount of virtual memory available in bytes.
# TYPE process_virtual_memory_max_bytes gauge
process_virtual_memory_max_bytes 1.8446744073709552e+19
# HELP rest_client_requests_total Number of HTTP requests, partitioned by status code, method, and host.
# TYPE rest_client_requests_total counter
rest_client_requests_total{code="200",host="10.43.0.1:443",method="GET"} 7

jonathon2nd commented 2 months ago

Here are some prometheus rules that I wrote for monitoring the deployed dragonfly. Hope it is useful some someone else who searches for it.

Just do a find-replace for api-cache and update it with whatever name your dragonfly is deployed as. For me I just named the name and namespace the same.

---
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: dragonfly-metrics-api-cache
  labels:
    app: api-cache
  namespace: api-cache
spec:
  groups:
    - name: dragonfly-metrics-rules
      rules:
        - alert: DragonflyMasterDown
          expr: count(dragonfly_master{app="api-cache"} == 1) != 1
          for: 5m
          labels:
            severity: critical
            app: api-cache
          annotations:
            summary: "Dragonfly Master is down"
            description: "The Dragonfly master instance for app=api-cache is not functioning as master."

        - alert: DragonflyReplicaCountLow
          expr: count(dragonfly_master{app="api-cache"} == 0) != 2
          for: 5m
          labels:
            severity: critical
            app: api-cache
          annotations:
            summary: "Dragonfly Replica count is not 2"
            description: "There should be exactly 2 replicas for app=api-cache at all times."

        - alert: DragonflyReplicaLagging
          expr: dragonfly_connected_replica_lag_records{app="api-cache"} > 0
          for: 5m
          labels:
            severity: warning
            app: api-cache
          annotations:
            summary: "Dragonfly Replica lag detected"
            description: "One or more Dragonfly replicas for app=api-cache are lagging behind the master."

        - alert: DragonflyUptimeLow
          expr: dragonfly_uptime_in_seconds{app="api-cache"} < 600
          for: 10m
          labels:
            severity: warning
            app: api-cache
          annotations:
            summary: "Dragonfly uptime is low"
            description: "Dragonfly uptime for app=api-cache is less than 10 minutes."

        - alert: DragonflyMaxClientsReached
          expr: dragonfly_connected_clients{app="api-cache"} / dragonfly_max_clients{app="api-cache"} > 0.9
          for: 1m
          labels:
            severity: critical
            app: api-cache
          annotations:
            summary: "Dragonfly max clients nearly reached"
            description: "Dragonfly connected clients for app=api-cache exceed 90% of the max limit."

        - alert: DragonflyBlockedClients
          expr: dragonfly_blocked_clients{app="api-cache"} > 0
          for: 5m
          labels:
            severity: warning
            app: api-cache
          annotations:
            summary: "Blocked clients detected"
            description: "There are blocked clients in the Dragonfly instance for app=api-cache."

        - alert: DragonflyMemoryUsageHigh
          expr: dragonfly_memory_used_bytes{app="api-cache"} / dragonfly_memory_max_bytes{app="api-cache"} > 0.8
          for: 5m
          labels:
            severity: critical
            app: api-cache
          annotations:
            summary: "Dragonfly memory usage is high"
            description: "Memory usage for app=api-cache exceeds 80% of the maximum limit."

        - alert: DragonflyPipelineCacheHigh
          expr: dragonfly_pipeline_cache_bytes{app="api-cache"} > 1000000
          for: 10m
          labels:
            severity: warning
            app: api-cache
          annotations:
            summary: "Dragonfly pipeline cache usage is high"
            description: "Pipeline cache for app=api-cache is higher than 1MB."

        - alert: DragonflyConnectionsReceivedHigh
          expr: rate(dragonfly_connections_received_total{app="api-cache"}[5m]) > 100
          for: 5m
          labels:
            severity: warning
            app: api-cache
          annotations:
            summary: "Dragonfly connections received rate high"
            description: "Dragonfly for app=api-cache is receiving connections at a high rate."

        - alert: DragonflyNetInputHigh
          expr: rate(dragonfly_net_input_bytes_total{app="api-cache"}[5m]) > 1000000
          for: 5m
          labels:
            severity: warning
            app: api-cache
          annotations:
            summary: "Dragonfly net input bytes rate is high"
            description: "Net input bytes for app=api-cache exceed 1MB per second."

        - alert: DragonflyEvictedKeys
          expr: dragonfly_evicted_keys_total{app="api-cache"} > 0
          for: 10m
          labels:
            severity: warning
            app: api-cache
          annotations:
            summary: "Dragonfly keys evicted"
            description: "Dragonfly for app=api-cache has evicted keys due to memory pressure."

Pothulapati commented 1 month ago

Thanks @jonathon2nd for the prometheus rules!

dragonflydb / dragonfly-operator

`serviceMonitor.enabled: true` does not work, or requires additional steps #242