caas-team / sparrow

A monitoring tool to gather infrastructure network information
Apache License 2.0
6 stars 4 forks source link

Error handling; don't panic. #61

Closed puffitos closed 8 months ago

puffitos commented 8 months ago

Motivation

Addresses #53

Still a WIP. I just wanted to put everything together as documentation so we can discuss this together next week.

Changes

A friendly guide for the changes:

Additionally, I couldn't resist the urge and did the following (sorry!):

Tests done

Normal run, metrics from remote config

# HELP sparrow_health_up Health of targets
# TYPE sparrow_health_up gauge
sparrow_health_up{target="https://caas-max-sparrow.caas-t02.telekom.de"} 1
sparrow_health_up{target="https://caas-max-sparrow.caas-t02.telekom.de/checks/health"} 1
sparrow_health_up{target="https://caas-max-sparrow.caas-t21.telekom.de"} 1
sparrow_health_up{target="https://gitlab.devops.telekom.de"} 1
sparrow_health_up{target="https://www.google.com/"} 1
# HELP sparrow_latency_count Count of latency checks done
# TYPE sparrow_latency_count counter
sparrow_latency_count{target="https://caas-max-sparrow.caas-t02.telekom.de"} 15
sparrow_latency_count{target="https://caas-max-sparrow.caas-t21.telekom.de"} 15
sparrow_latency_count{target="https://example.com/"} 14
sparrow_latency_count{target="https://gitlab.devops.telekom.de"} 14
sparrow_latency_count{target="https://google.com/"} 14
sparrow_latency_count{target="https://yam.telekom.de"} 14
# HELP sparrow_latency_duration Latency of targets in seconds
# TYPE sparrow_latency_duration histogram
sparrow_latency_duration_bucket{target="https://caas-max-sparrow.caas-t02.telekom.de",le="0.005"} 1
sparrow_latency_duration_bucket{target="https://caas-max-sparrow.caas-t02.telekom.de",le="0.01"} 1
sparrow_latency_duration_bucket{target="https://caas-max-sparrow.caas-t02.telekom.de",le="0.025"} 11
sparrow_latency_duration_bucket{target="https://caas-max-sparrow.caas-t02.telekom.de",le="0.05"} 13
sparrow_latency_duration_bucket{target="https://caas-max-sparrow.caas-t02.telekom.de",le="0.1"} 13
sparrow_latency_duration_bucket{target="https://caas-max-sparrow.caas-t02.telekom.de",le="0.25"} 15
sparrow_latency_duration_bucket{target="https://caas-max-sparrow.caas-t02.telekom.de",le="0.5"} 15
sparrow_latency_duration_bucket{target="https://caas-max-sparrow.caas-t02.telekom.de",le="1"} 15
sparrow_latency_duration_bucket{target="https://caas-max-sparrow.caas-t02.telekom.de",le="2.5"} 15
sparrow_latency_duration_bucket{target="https://caas-max-sparrow.caas-t02.telekom.de",le="5"} 15
sparrow_latency_duration_bucket{target="https://caas-max-sparrow.caas-t02.telekom.de",le="10"} 15
sparrow_latency_duration_bucket{target="https://caas-max-sparrow.caas-t02.telekom.de",le="+Inf"} 15
sparrow_latency_duration_sum{target="https://caas-max-sparrow.caas-t02.telekom.de"} 0.5367250109999999
sparrow_latency_duration_count{target="https://caas-max-sparrow.caas-t02.telekom.de"} 15
sparrow_latency_duration_bucket{target="https://caas-max-sparrow.caas-t21.telekom.de",le="0.005"} 1
sparrow_latency_duration_bucket{target="https://caas-max-sparrow.caas-t21.telekom.de",le="0.01"} 1
sparrow_latency_duration_bucket{target="https://caas-max-sparrow.caas-t21.telekom.de",le="0.025"} 2
sparrow_latency_duration_bucket{target="https://caas-max-sparrow.caas-t21.telekom.de",le="0.05"} 13
sparrow_latency_duration_bucket{target="https://caas-max-sparrow.caas-t21.telekom.de",le="0.1"} 13
sparrow_latency_duration_bucket{target="https://caas-max-sparrow.caas-t21.telekom.de",le="0.25"} 15
sparrow_latency_duration_bucket{target="https://caas-max-sparrow.caas-t21.telekom.de",le="0.5"} 15
sparrow_latency_duration_bucket{target="https://caas-max-sparrow.caas-t21.telekom.de",le="1"} 15
sparrow_latency_duration_bucket{target="https://caas-max-sparrow.caas-t21.telekom.de",le="2.5"} 15
sparrow_latency_duration_bucket{target="https://caas-max-sparrow.caas-t21.telekom.de",le="5"} 15
sparrow_latency_duration_bucket{target="https://caas-max-sparrow.caas-t21.telekom.de",le="10"} 15
sparrow_latency_duration_bucket{target="https://caas-max-sparrow.caas-t21.telekom.de",le="+Inf"} 15
sparrow_latency_duration_sum{target="https://caas-max-sparrow.caas-t21.telekom.de"} 0.54715294
sparrow_latency_duration_count{target="https://caas-max-sparrow.caas-t21.telekom.de"} 15
sparrow_latency_duration_bucket{target="https://example.com/",le="0.005"} 1
sparrow_latency_duration_bucket{target="https://example.com/",le="0.01"} 1
sparrow_latency_duration_bucket{target="https://example.com/",le="0.025"} 1
sparrow_latency_duration_bucket{target="https://example.com/",le="0.05"} 1
sparrow_latency_duration_bucket{target="https://example.com/",le="0.1"} 1
sparrow_latency_duration_bucket{target="https://example.com/",le="0.25"} 12
sparrow_latency_duration_bucket{target="https://example.com/",le="0.5"} 12
sparrow_latency_duration_bucket{target="https://example.com/",le="1"} 13
sparrow_latency_duration_bucket{target="https://example.com/",le="2.5"} 14
sparrow_latency_duration_bucket{target="https://example.com/",le="5"} 14
sparrow_latency_duration_bucket{target="https://example.com/",le="10"} 14
sparrow_latency_duration_bucket{target="https://example.com/",le="+Inf"} 14
sparrow_latency_duration_sum{target="https://example.com/"} 2.9955032790000002
sparrow_latency_duration_count{target="https://example.com/"} 14
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="0.005"} 3
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="0.01"} 3
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="0.025"} 3
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="0.05"} 3
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="0.1"} 3
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="0.25"} 3
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="0.5"} 14
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="1"} 14
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="2.5"} 14
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="5"} 14
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="10"} 14
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="+Inf"} 14
sparrow_latency_duration_sum{target="https://gitlab.devops.telekom.de"} 3.6277407059999995
sparrow_latency_duration_count{target="https://gitlab.devops.telekom.de"} 14
sparrow_latency_duration_bucket{target="https://google.com/",le="0.005"} 2
sparrow_latency_duration_bucket{target="https://google.com/",le="0.01"} 2
sparrow_latency_duration_bucket{target="https://google.com/",le="0.025"} 2
sparrow_latency_duration_bucket{target="https://google.com/",le="0.05"} 2
sparrow_latency_duration_bucket{target="https://google.com/",le="0.1"} 2
sparrow_latency_duration_bucket{target="https://google.com/",le="0.25"} 13
sparrow_latency_duration_bucket{target="https://google.com/",le="0.5"} 14
sparrow_latency_duration_bucket{target="https://google.com/",le="1"} 14
sparrow_latency_duration_bucket{target="https://google.com/",le="2.5"} 14
sparrow_latency_duration_bucket{target="https://google.com/",le="5"} 14
sparrow_latency_duration_bucket{target="https://google.com/",le="10"} 14
sparrow_latency_duration_bucket{target="https://google.com/",le="+Inf"} 14
sparrow_latency_duration_sum{target="https://google.com/"} 2.066476341
sparrow_latency_duration_count{target="https://google.com/"} 14
sparrow_latency_duration_bucket{target="https://yam.telekom.de",le="0.005"} 2
sparrow_latency_duration_bucket{target="https://yam.telekom.de",le="0.01"} 2
sparrow_latency_duration_bucket{target="https://yam.telekom.de",le="0.025"} 2
sparrow_latency_duration_bucket{target="https://yam.telekom.de",le="0.05"} 2
sparrow_latency_duration_bucket{target="https://yam.telekom.de",le="0.1"} 2
sparrow_latency_duration_bucket{target="https://yam.telekom.de",le="0.25"} 11
sparrow_latency_duration_bucket{target="https://yam.telekom.de",le="0.5"} 14
sparrow_latency_duration_bucket{target="https://yam.telekom.de",le="1"} 14
sparrow_latency_duration_bucket{target="https://yam.telekom.de",le="2.5"} 14
sparrow_latency_duration_bucket{target="https://yam.telekom.de",le="5"} 14
sparrow_latency_duration_bucket{target="https://yam.telekom.de",le="10"} 14
sparrow_latency_duration_bucket{target="https://yam.telekom.de",le="+Inf"} 14
sparrow_latency_duration_sum{target="https://yam.telekom.de"} 2.8014992370000003
sparrow_latency_duration_count{target="https://yam.telekom.de"} 14
# HELP sparrow_latency_duration_seconds Latency with status information of targets
# TYPE sparrow_latency_duration_seconds gauge
sparrow_latency_duration_seconds{status="0",target="https://caas-max-sparrow.caas-t02.telekom.de"} 0
sparrow_latency_duration_seconds{status="0",target="https://caas-max-sparrow.caas-t21.telekom.de"} 0
sparrow_latency_duration_seconds{status="0",target="https://example.com/"} 0
sparrow_latency_duration_seconds{status="0",target="https://gitlab.devops.telekom.de"} 0
sparrow_latency_duration_seconds{status="0",target="https://google.com/"} 0
sparrow_latency_duration_seconds{status="0",target="https://yam.telekom.de"} 0
sparrow_latency_duration_seconds{status="200",target="https://caas-max-sparrow.caas-t02.telekom.de"} 0.0246382
sparrow_latency_duration_seconds{status="200",target="https://caas-max-sparrow.caas-t21.telekom.de"} 0.031552157
sparrow_latency_duration_seconds{status="200",target="https://example.com/"} 0.12981956
sparrow_latency_duration_seconds{status="200",target="https://gitlab.devops.telekom.de"} 0.319594274
sparrow_latency_duration_seconds{status="200",target="https://google.com/"} 0.153979783
sparrow_latency_duration_seconds{status="418",target="https://yam.telekom.de"} 0.212991898

Cancel context after 30 secs

A timeout context was set in the top-level sparrow.Run(ctx) call. After running for a bit, we get the following logs. Note the many errors in the various checks, the api shutdown and the target manager shutdown. Finally, sparrow exits with 1 as it should :)

{"time":"2024-01-05T18:54:15.235186838+01:00","level":"INFO","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow/gitlab.(*Client).FetchFiles","file":"/home/bbressi/dev/repos/sparrow/pkg/sparrow/gitlab/gitlab.go","line":147},"msg":"Successfully fetched all target files","files":8}
{"time":"2024-01-05T18:54:16.622747226+01:00","level":"INFO","source":{"function":"github.com/caas-team/sparrow/pkg/config.(*HttpLoader).Run","file":"/home/bbressi/dev/repos/sparrow/pkg/config/http.go","line":71},"msg":"Successfully got remote runtime configuration"}
{"time":"2024-01-05T18:54:20.957935964+01:00","level":"INFO","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow.(*Sparrow).handleErrors","file":"/home/bbressi/dev/repos/sparrow/pkg/sparrow/run.go","line":330},"msg":"Context done, shutting down sparrow"}
{"time":"2024-01-05T18:54:20.957941044+01:00","level":"ERROR","source":{"function":"github.com/caas-team/sparrow/pkg/checks.getLatency","file":"/home/bbressi/dev/repos/sparrow/pkg/checks/latency.go","line":274},"msg":"Error while checking latency","url":"https://google.com/","error":"Get \"https://www.google.com/\": context deadline exceeded"}
{"time":"2024-01-05T18:54:20.959002788+01:00","level":"ERROR","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow.(*Sparrow).registerCheck.func1","file":"/home/bbressi/dev/repos/sparrow/pkg/sparrow/run.go","line":257},"msg":"Failed to run check","name":"health","error":"context deadline exceeded"}
{"time":"2024-01-05T18:54:20.959042803+01:00","level":"ERROR","source":{"function":"github.com/caas-team/sparrow/pkg/checks.getLatency","file":"/home/bbressi/dev/repos/sparrow/pkg/checks/latency.go","line":274},"msg":"Error while checking latency","url":"https://example.com/","error":"Get \"https://example.com/\": context deadline exceeded"}
{"time":"2024-01-05T18:54:20.959082609+01:00","level":"ERROR","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow/gitlab.(*Client).fetchFile","file":"/home/bbressi/dev/repos/sparrow/pkg/sparrow/gitlab/gitlab.go","line":171},"msg":"Failed to fetch file","file":"sparrow-dev-cool.de.json","error":"Get \"https://gitlab.devops.telekom.de/api/v4/projects/237078/repository/files/sparrow-dev-cool.de.json/raw?ref=main\": context deadline exceeded"}
{"time":"2024-01-05T18:54:20.959101665+01:00","level":"ERROR","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow/gitlab.(*Client).FetchFiles","file":"/home/bbressi/dev/repos/sparrow/pkg/sparrow/gitlab/gitlab.go","line":142},"msg":"Failed fetching files","error":"Get \"https://gitlab.devops.telekom.de/api/v4/projects/237078/repository/files/sparrow-dev-cool.de.json/raw?ref=main\": context deadline exceeded"}
{"time":"2024-01-05T18:54:20.959088671+01:00","level":"ERROR","source":{"function":"github.com/caas-team/sparrow/pkg/checks.getLatency","file":"/home/bbressi/dev/repos/sparrow/pkg/checks/latency.go","line":274},"msg":"Error while checking latency","url":"https://yam.telekom.de","error":"Get \"https://yam-united.telekom.com/pages\": context deadline exceeded"}
{"time":"2024-01-05T18:54:20.959117124+01:00","level":"ERROR","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow/targets.(*gitlabTargetManager).refreshTargets","file":"/home/bbressi/dev/repos/sparrow/pkg/sparrow/targets/gitlab.go","line":203},"msg":"Failed to update global targets","error":"Get \"https://gitlab.devops.telekom.de/api/v4/projects/237078/repository/files/sparrow-dev-cool.de.json/raw?ref=main\": context deadline exceeded"}
{"time":"2024-01-05T18:54:20.959044988+01:00","level":"ERROR","source":{"function":"github.com/caas-team/sparrow/pkg/checks.getLatency","file":"/home/bbressi/dev/repos/sparrow/pkg/checks/latency.go","line":274},"msg":"Error while checking latency","url":"https://gitlab.devops.telekom.de","error":"Get \"https://gitlab.devops.telekom.de/users/sign_in\": context deadline exceeded"}
{"time":"2024-01-05T18:54:20.959173922+01:00","level":"INFO","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow/targets.(*gitlabTargetManager).Shutdown","file":"/home/bbressi/dev/repos/sparrow/pkg/sparrow/targets/gitlab.go","line":149},"msg":"Stopping gitlab reconciliation routine"}
{"time":"2024-01-05T18:54:20.959185023+01:00","level":"ERROR","source":{"function":"github.com/caas-team/sparrow/pkg/checks.(*Latency).Run","file":"/home/bbressi/dev/repos/sparrow/pkg/checks/latency.go","line":90},"msg":"Context canceled","err":"context deadline exceeded"}
{"time":"2024-01-05T18:54:20.959235329+01:00","level":"ERROR","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow.(*Sparrow).registerCheck.func1","file":"/home/bbressi/dev/repos/sparrow/pkg/sparrow/run.go","line":257},"msg":"Failed to run check","name":"latency","error":"context deadline exceeded"}
{"time":"2024-01-05T18:54:20.959233375+01:00","level":"ERROR","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow.(*Sparrow).api.func1","file":"/home/bbressi/dev/repos/sparrow/pkg/sparrow/api.go","line":86},"msg":"Failed to serve api","error":"http: Server closed"}
{"time":"2024-01-05T18:54:20.959139457+01:00","level":"ERROR","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow/targets.(*gitlabTargetManager).Reconcile","file":"/home/bbressi/dev/repos/sparrow/pkg/sparrow/targets/gitlab.go","line":99},"msg":"Failed to get global targets","error":"Get \"https://gitlab.devops.telekom.de/api/v4/projects/237078/repository/files/sparrow-dev-cool.de.json/raw?ref=main\": context deadline exceeded"}
{"time":"2024-01-05T18:54:20.959275775+01:00","level":"ERROR","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow.(*Sparrow).handleErrors","file":"/home/bbressi/dev/repos/sparrow/pkg/sparrow/run.go","line":335},"msg":"Error in sparrow component, shutting down","error":"context deadline exceeded"}
{"time":"2024-01-05T18:54:20.959285975+01:00","level":"INFO","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow/targets.(*gitlabTargetManager).Reconcile","file":"/home/bbressi/dev/repos/sparrow/pkg/sparrow/targets/gitlab.go","line":94},"msg":"Gitlab target reconciliation ended"}
{"time":"2024-01-05T18:54:20.95933082+01:00","level":"ERROR","source":{"function":"github.com/caas-team/sparrow/cmd.NewCmdRun.run.func1","file":"/home/bbressi/dev/repos/sparrow/cmd/run.go","line":123},"msg":"Error while running sparrow","error":"sparrow was shut down"}

TODO

puffitos commented 8 months ago

Ready for review.