Still a WIP. I just wanted to put everything together as documentation so we can discuss this together next week.
Changes
A friendly guide for the changes:
No panic, only errors
Two new channels for sparrow, to handle errors and to handle end of life
All sparrow components start as goroutines in Run and write their errors into the error channel (when something irrecoverable happens)
The errors are handled by a separate goroutine (handleErrors), which may be extended to handle errors in other ways (currently only fatal errors are expected in the error channel)
The handleErrors will gracefully shutdown all sparrow components (if possible) and return the error(s) which led to the shutdown, along with other errors that happened during shutdown.
Each component should/ must have a shutdown function, to be terminated gracefully
Additionally, I couldn't resist the urge and did the following (sorry!):
Added the source to the logger, so we know where the log is produced
Removed the bulky and unhelpful withGroup from all loggers
Tests done
[x] Normal run without interruptions with local config
[x] Normal run without interruptions with remote config
[x] Running with a context that will only last 30 seconds -> routines shutdown and sparrow exits with 1
[x] Running with misconfigured sparrow config (wrong filepath) -> api starts and shuts down immediately → sparrow exits with 1
A timeout context was set in the top-level sparrow.Run(ctx) call. After running for a bit, we get the following logs. Note the many errors in the various checks, the api shutdown and the target manager shutdown. Finally, sparrow exits with 1 as it should :)
{"time":"2024-01-05T18:54:15.235186838+01:00","level":"INFO","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow/gitlab.(*Client).FetchFiles","file":"/home/bbressi/dev/repos/sparrow/pkg/sparrow/gitlab/gitlab.go","line":147},"msg":"Successfully fetched all target files","files":8}
{"time":"2024-01-05T18:54:16.622747226+01:00","level":"INFO","source":{"function":"github.com/caas-team/sparrow/pkg/config.(*HttpLoader).Run","file":"/home/bbressi/dev/repos/sparrow/pkg/config/http.go","line":71},"msg":"Successfully got remote runtime configuration"}
{"time":"2024-01-05T18:54:20.957935964+01:00","level":"INFO","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow.(*Sparrow).handleErrors","file":"/home/bbressi/dev/repos/sparrow/pkg/sparrow/run.go","line":330},"msg":"Context done, shutting down sparrow"}
{"time":"2024-01-05T18:54:20.957941044+01:00","level":"ERROR","source":{"function":"github.com/caas-team/sparrow/pkg/checks.getLatency","file":"/home/bbressi/dev/repos/sparrow/pkg/checks/latency.go","line":274},"msg":"Error while checking latency","url":"https://google.com/","error":"Get \"https://www.google.com/\": context deadline exceeded"}
{"time":"2024-01-05T18:54:20.959002788+01:00","level":"ERROR","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow.(*Sparrow).registerCheck.func1","file":"/home/bbressi/dev/repos/sparrow/pkg/sparrow/run.go","line":257},"msg":"Failed to run check","name":"health","error":"context deadline exceeded"}
{"time":"2024-01-05T18:54:20.959042803+01:00","level":"ERROR","source":{"function":"github.com/caas-team/sparrow/pkg/checks.getLatency","file":"/home/bbressi/dev/repos/sparrow/pkg/checks/latency.go","line":274},"msg":"Error while checking latency","url":"https://example.com/","error":"Get \"https://example.com/\": context deadline exceeded"}
{"time":"2024-01-05T18:54:20.959082609+01:00","level":"ERROR","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow/gitlab.(*Client).fetchFile","file":"/home/bbressi/dev/repos/sparrow/pkg/sparrow/gitlab/gitlab.go","line":171},"msg":"Failed to fetch file","file":"sparrow-dev-cool.de.json","error":"Get \"https://gitlab.devops.telekom.de/api/v4/projects/237078/repository/files/sparrow-dev-cool.de.json/raw?ref=main\": context deadline exceeded"}
{"time":"2024-01-05T18:54:20.959101665+01:00","level":"ERROR","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow/gitlab.(*Client).FetchFiles","file":"/home/bbressi/dev/repos/sparrow/pkg/sparrow/gitlab/gitlab.go","line":142},"msg":"Failed fetching files","error":"Get \"https://gitlab.devops.telekom.de/api/v4/projects/237078/repository/files/sparrow-dev-cool.de.json/raw?ref=main\": context deadline exceeded"}
{"time":"2024-01-05T18:54:20.959088671+01:00","level":"ERROR","source":{"function":"github.com/caas-team/sparrow/pkg/checks.getLatency","file":"/home/bbressi/dev/repos/sparrow/pkg/checks/latency.go","line":274},"msg":"Error while checking latency","url":"https://yam.telekom.de","error":"Get \"https://yam-united.telekom.com/pages\": context deadline exceeded"}
{"time":"2024-01-05T18:54:20.959117124+01:00","level":"ERROR","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow/targets.(*gitlabTargetManager).refreshTargets","file":"/home/bbressi/dev/repos/sparrow/pkg/sparrow/targets/gitlab.go","line":203},"msg":"Failed to update global targets","error":"Get \"https://gitlab.devops.telekom.de/api/v4/projects/237078/repository/files/sparrow-dev-cool.de.json/raw?ref=main\": context deadline exceeded"}
{"time":"2024-01-05T18:54:20.959044988+01:00","level":"ERROR","source":{"function":"github.com/caas-team/sparrow/pkg/checks.getLatency","file":"/home/bbressi/dev/repos/sparrow/pkg/checks/latency.go","line":274},"msg":"Error while checking latency","url":"https://gitlab.devops.telekom.de","error":"Get \"https://gitlab.devops.telekom.de/users/sign_in\": context deadline exceeded"}
{"time":"2024-01-05T18:54:20.959173922+01:00","level":"INFO","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow/targets.(*gitlabTargetManager).Shutdown","file":"/home/bbressi/dev/repos/sparrow/pkg/sparrow/targets/gitlab.go","line":149},"msg":"Stopping gitlab reconciliation routine"}
{"time":"2024-01-05T18:54:20.959185023+01:00","level":"ERROR","source":{"function":"github.com/caas-team/sparrow/pkg/checks.(*Latency).Run","file":"/home/bbressi/dev/repos/sparrow/pkg/checks/latency.go","line":90},"msg":"Context canceled","err":"context deadline exceeded"}
{"time":"2024-01-05T18:54:20.959235329+01:00","level":"ERROR","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow.(*Sparrow).registerCheck.func1","file":"/home/bbressi/dev/repos/sparrow/pkg/sparrow/run.go","line":257},"msg":"Failed to run check","name":"latency","error":"context deadline exceeded"}
{"time":"2024-01-05T18:54:20.959233375+01:00","level":"ERROR","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow.(*Sparrow).api.func1","file":"/home/bbressi/dev/repos/sparrow/pkg/sparrow/api.go","line":86},"msg":"Failed to serve api","error":"http: Server closed"}
{"time":"2024-01-05T18:54:20.959139457+01:00","level":"ERROR","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow/targets.(*gitlabTargetManager).Reconcile","file":"/home/bbressi/dev/repos/sparrow/pkg/sparrow/targets/gitlab.go","line":99},"msg":"Failed to get global targets","error":"Get \"https://gitlab.devops.telekom.de/api/v4/projects/237078/repository/files/sparrow-dev-cool.de.json/raw?ref=main\": context deadline exceeded"}
{"time":"2024-01-05T18:54:20.959275775+01:00","level":"ERROR","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow.(*Sparrow).handleErrors","file":"/home/bbressi/dev/repos/sparrow/pkg/sparrow/run.go","line":335},"msg":"Error in sparrow component, shutting down","error":"context deadline exceeded"}
{"time":"2024-01-05T18:54:20.959285975+01:00","level":"INFO","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow/targets.(*gitlabTargetManager).Reconcile","file":"/home/bbressi/dev/repos/sparrow/pkg/sparrow/targets/gitlab.go","line":94},"msg":"Gitlab target reconciliation ended"}
{"time":"2024-01-05T18:54:20.95933082+01:00","level":"ERROR","source":{"function":"github.com/caas-team/sparrow/cmd.NewCmdRun.run.func1","file":"/home/bbressi/dev/repos/sparrow/cmd/run.go","line":123},"msg":"Error while running sparrow","error":"sparrow was shut down"}
Motivation
Addresses #53
Still a WIP. I just wanted to put everything together as documentation so we can discuss this together next week.
Changes
A friendly guide for the changes:
Run
and write their errors into the error channel (when something irrecoverable happens)handleErrors
), which may be extended to handle errors in other ways (currently only fatal errors are expected in the error channel)handleErrors
will gracefully shutdown all sparrow components (if possible) and return the error(s) which led to the shutdown, along with other errors that happened during shutdown.Additionally, I couldn't resist the urge and did the following (sorry!):
withGroup
from all loggersTests done
Normal run, metrics from remote config
Cancel context after 30 secs
A timeout context was set in the top-level sparrow.Run(ctx) call. After running for a bit, we get the following logs. Note the many errors in the various checks, the api shutdown and the target manager shutdown. Finally, sparrow exits with 1 as it should :)
TODO