caas-team / sparrow

A monitoring tool to gather infrastructure network information
Apache License 2.0
6 stars 4 forks source link

refactor: check reconcilation #102

Closed lvlcn-t closed 4 months ago

lvlcn-t commented 4 months ago

Motivation

To simplify the check (config) reconciliation logic. The previous implementation used a map of check names to checks, which added unnecessary complexity to the registration, deletion, and updating of checks.

Closes #98

Changes

A new struct, ChecksController, has been introduced to handle the reconciliation logic of the Checks. This component can start, act, and shutdown autonomously, making the registration, update, and removal of checks simpler and more straightforward.

In addition, a runtime.Checks struct has been introduced to hold the checks in a slice and provide thread-safe methods to add, delete, and iterate over the checks. This struct complements the runtime.Config struct, which holds the configuration for the checks. The checks are now directly held in a dynamic slice, further simplifying the logic and making the code easier to understand and maintain.

The changes include:

For additional information look at the commits.

Tests done

I've provided several new tests.

Manual e2e tests

Logs:

$ go run main.go run --config .tmp/config/start-config.yaml 
Using config file: .tmp/config/start-config.yaml
{"time":"2024-02-08T12:57:08.155606092+01:00","level":"INFO","source":{"function":"github.com/caas-team/sparrow/cmd.NewCmdRun.run.func1","file":"/home/installadm/dev/github/sparrow/cmd/run.go","line":81},"msg":"Running sparrow"}
{"time":"2024-02-08T12:57:08.15570043+01:00","level":"INFO","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow/targets.(*gitlabTargetManager).Reconcile","file":"/home/installadm/dev/github/sparrow/pkg/sparrow/targets/gitlab.go","line":81},"msg":"Starting global gitlabTargetManager reconciler"}
{"time":"2024-02-08T12:57:08.15594968+01:00","level":"INFO","source":{"function":"github.com/caas-team/sparrow/pkg/api.(*api).Run.func1","file":"/home/installadm/dev/github/sparrow/pkg/api/api.go","line":76},"msg":"Serving Api","addr":":8080"}
{"time":"2024-02-08T12:57:38.179964595+01:00","level":"INFO","source":{"function":"github.com/caas-team/sparrow/pkg/config.(*FileLoader).Run","file":"/home/installadm/dev/github/sparrow/pkg/config/file.go","line":79},"msg":"Successfully got local runtime configuration"}
{"time":"2024-02-08T12:57:38.180136378+01:00","level":"INFO","source":{"function":"github.com/caas-team/sparrow/pkg/checks/dns.(*DNS).Run","file":"/home/installadm/dev/github/sparrow/pkg/checks/dns/dns.go","line":99},"msg":"Starting dns check","interval":"20s"}
{"time":"2024-02-08T12:57:38.180178287+01:00","level":"INFO","source":{"function":"github.com/caas-team/sparrow/pkg/checks/health.(*Health).Run","file":"/home/installadm/dev/github/sparrow/pkg/checks/health/health.go","line":91},"msg":"Starting healthcheck","interval":"10s"}
{"time":"2024-02-08T12:57:38.18027275+01:00","level":"INFO","source":{"function":"github.com/caas-team/sparrow/pkg/checks/latency.(*Latency).Run","file":"/home/installadm/dev/github/sparrow/pkg/checks/latency/latency.go","line":97},"msg":"Starting latency check","interval":"20s"}
{"time":"2024-02-08T12:58:08.181125735+01:00","level":"INFO","source":{"function":"github.com/caas-team/sparrow/pkg/config.(*FileLoader).Run","file":"/home/installadm/dev/github/sparrow/pkg/config/file.go","line":79},"msg":"Successfully got local runtime configuration"}
{"time":"2024-02-08T12:58:08.392343533+01:00","level":"INFO","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow/gitlab.(*Client).FetchFiles","file":"/home/installadm/dev/github/sparrow/pkg/sparrow/gitlab/gitlab.go","line":149},"msg":"Successfully fetched all target files","files":2}
{"time":"2024-02-08T12:58:38.182087954+01:00","level":"INFO","source":{"function":"github.com/caas-team/sparrow/pkg/config.(*FileLoader).Run","file":"/home/installadm/dev/github/sparrow/pkg/config/file.go","line":79},"msg":"Successfully got local runtime configuration"}
{"time":"2024-02-08T12:59:08.182573733+01:00","level":"INFO","source":{"function":"github.com/caas-team/sparrow/pkg/config.(*FileLoader).Run","file":"/home/installadm/dev/github/sparrow/pkg/config/file.go","line":79},"msg":"Successfully got local runtime configuration"}

First reconcilation interval:

# HELP sparrow_dns_check_count Total number of DNS checks performed on the target and if they were successful.
# TYPE sparrow_dns_check_count counter
sparrow_dns_check_count{target="10.x.x.x"} 5
sparrow_dns_check_count{target="www.t-systems.com"} 5
sparrow_dns_check_count{target="www.telekom.de"} 5
# HELP sparrow_dns_duration Histogram of response times for DNS checks in seconds.
# TYPE sparrow_dns_duration histogram
sparrow_dns_duration_bucket{target="10.x.x.x",le="0.005"} 4
sparrow_dns_duration_bucket{target="10.x.x.x",le="0.01"} 4
sparrow_dns_duration_bucket{target="10.x.x.x",le="0.025"} 5
sparrow_dns_duration_bucket{target="10.x.x.x",le="0.05"} 5
sparrow_dns_duration_bucket{target="10.x.x.x",le="0.1"} 5
sparrow_dns_duration_bucket{target="10.x.x.x",le="0.25"} 5
sparrow_dns_duration_bucket{target="10.x.x.x",le="0.5"} 5
sparrow_dns_duration_bucket{target="10.x.x.x",le="1"} 5
sparrow_dns_duration_bucket{target="10.x.x.x",le="2.5"} 5
sparrow_dns_duration_bucket{target="10.x.x.x",le="5"} 5
sparrow_dns_duration_bucket{target="10.x.x.x",le="10"} 5
sparrow_dns_duration_bucket{target="10.x.x.x",le="+Inf"} 5
sparrow_dns_duration_sum{target="10.x.x.x"} 0.018343466000000003
sparrow_dns_duration_count{target="10.x.x.x"} 5
sparrow_dns_duration_bucket{target="www.t-systems.com",le="0.005"} 4
sparrow_dns_duration_bucket{target="www.t-systems.com",le="0.01"} 5
sparrow_dns_duration_bucket{target="www.t-systems.com",le="0.025"} 5
sparrow_dns_duration_bucket{target="www.t-systems.com",le="0.05"} 5
sparrow_dns_duration_bucket{target="www.t-systems.com",le="0.1"} 5
sparrow_dns_duration_bucket{target="www.t-systems.com",le="0.25"} 5
sparrow_dns_duration_bucket{target="www.t-systems.com",le="0.5"} 5
sparrow_dns_duration_bucket{target="www.t-systems.com",le="1"} 5
sparrow_dns_duration_bucket{target="www.t-systems.com",le="2.5"} 5
sparrow_dns_duration_bucket{target="www.t-systems.com",le="5"} 5
sparrow_dns_duration_bucket{target="www.t-systems.com",le="10"} 5
sparrow_dns_duration_bucket{target="www.t-systems.com",le="+Inf"} 5
sparrow_dns_duration_sum{target="www.t-systems.com"} 0.018483263
sparrow_dns_duration_count{target="www.t-systems.com"} 5
sparrow_dns_duration_bucket{target="www.telekom.de",le="0.005"} 5
sparrow_dns_duration_bucket{target="www.telekom.de",le="0.01"} 5
sparrow_dns_duration_bucket{target="www.telekom.de",le="0.025"} 5
sparrow_dns_duration_bucket{target="www.telekom.de",le="0.05"} 5
sparrow_dns_duration_bucket{target="www.telekom.de",le="0.1"} 5
sparrow_dns_duration_bucket{target="www.telekom.de",le="0.25"} 5
sparrow_dns_duration_bucket{target="www.telekom.de",le="0.5"} 5
sparrow_dns_duration_bucket{target="www.telekom.de",le="1"} 5
sparrow_dns_duration_bucket{target="www.telekom.de",le="2.5"} 5
sparrow_dns_duration_bucket{target="www.telekom.de",le="5"} 5
sparrow_dns_duration_bucket{target="www.telekom.de",le="10"} 5
sparrow_dns_duration_bucket{target="www.telekom.de",le="+Inf"} 5
sparrow_dns_duration_sum{target="www.telekom.de"} 0.016302045
sparrow_dns_duration_count{target="www.telekom.de"} 5
# HELP sparrow_dns_duration_seconds Duration of DNS resolution attempts in seconds.
# TYPE sparrow_dns_duration_seconds gauge
sparrow_dns_duration_seconds{target="10.x.x.x"} 0.000708375
sparrow_dns_duration_seconds{target="www.t-systems.com"} 0.003439707
sparrow_dns_duration_seconds{target="www.telekom.de"} 0.003415548
# HELP sparrow_dns_status Specifies if the target can be resolved.
# TYPE sparrow_dns_status gauge
sparrow_dns_status{target="10.x.x.x"} 1
sparrow_dns_status{target="www.t-systems.com"} 1
sparrow_dns_status{target="www.telekom.de"} 1
# HELP sparrow_health_up Health of targets
# TYPE sparrow_health_up gauge
sparrow_health_up{target="https://gitlab.devops.telekom.de"} 1
sparrow_health_up{target="https://www.example.com"} 1
sparrow_health_up{target="https://www.google.com"} 1
sparrow_health_up{target="https://www.telekom.de"} 1
# HELP sparrow_latency_count Count of latency checks done
# TYPE sparrow_latency_count counter
sparrow_latency_count{target="https://example.com"} 5
sparrow_latency_count{target="https://google.com"} 5
# HELP sparrow_latency_duration Latency of targets in seconds
# TYPE sparrow_latency_duration histogram
sparrow_latency_duration_bucket{target="https://example.com",le="0.005"} 0
sparrow_latency_duration_bucket{target="https://example.com",le="0.01"} 0
sparrow_latency_duration_bucket{target="https://example.com",le="0.025"} 0
sparrow_latency_duration_bucket{target="https://example.com",le="0.05"} 0
sparrow_latency_duration_bucket{target="https://example.com",le="0.1"} 3
sparrow_latency_duration_bucket{target="https://example.com",le="0.25"} 4
sparrow_latency_duration_bucket{target="https://example.com",le="0.5"} 5
sparrow_latency_duration_bucket{target="https://example.com",le="1"} 5
sparrow_latency_duration_bucket{target="https://example.com",le="2.5"} 5
sparrow_latency_duration_bucket{target="https://example.com",le="5"} 5
sparrow_latency_duration_bucket{target="https://example.com",le="10"} 5
sparrow_latency_duration_bucket{target="https://example.com",le="+Inf"} 5
sparrow_latency_duration_sum{target="https://example.com"} 0.800343934
sparrow_latency_duration_count{target="https://example.com"} 5
sparrow_latency_duration_bucket{target="https://google.com",le="0.005"} 0
sparrow_latency_duration_bucket{target="https://google.com",le="0.01"} 0
sparrow_latency_duration_bucket{target="https://google.com",le="0.025"} 0
sparrow_latency_duration_bucket{target="https://google.com",le="0.05"} 0
sparrow_latency_duration_bucket{target="https://google.com",le="0.1"} 3
sparrow_latency_duration_bucket{target="https://google.com",le="0.25"} 5
sparrow_latency_duration_bucket{target="https://google.com",le="0.5"} 5
sparrow_latency_duration_bucket{target="https://google.com",le="1"} 5
sparrow_latency_duration_bucket{target="https://google.com",le="2.5"} 5
sparrow_latency_duration_bucket{target="https://google.com",le="5"} 5
sparrow_latency_duration_bucket{target="https://google.com",le="10"} 5
sparrow_latency_duration_bucket{target="https://google.com",le="+Inf"} 5
sparrow_latency_duration_sum{target="https://google.com"} 0.544606929
sparrow_latency_duration_count{target="https://google.com"} 5
# HELP sparrow_latency_duration_seconds Latency with status information of targets
# TYPE sparrow_latency_duration_seconds gauge
sparrow_latency_duration_seconds{status="200",target="https://example.com"} 0.100032163
sparrow_latency_duration_seconds{status="200",target="https://google.com"} 0.100295391

Second reconcilation interval:

# HELP sparrow_health_up Health of targets
# TYPE sparrow_health_up gauge
sparrow_health_up{target="https://gitlab.devops.telekom.de"} 1
sparrow_health_up{target="https://www.example.com"} 1
sparrow_health_up{target="https://www.google.com"} 1
sparrow_health_up{target="https://www.telekom.de"} 1

TODO

lvlcn-t commented 4 months ago

Additionally this PR should also close #36 and #43

lvlcn-t commented 4 months ago

I'll resolve merge conflicts after #106 is merged into this branch.