getsentry / uptime-checker

Service responsible for powering Sentry's uptime detection features
Other
3 stars 0 forks source link

Implement configuration consumer #58

Open evanpurkhiser opened 2 weeks ago

evanpurkhiser commented 2 weeks ago

[!NOTE] Original design document section

For uptime monitoring multiple domains we need some way to tell the configuration about each of checks it needs to be making. We will do this through a CheckConfiguration each check configuration will map back to a subscription_id in sentry (similar to a monitor_environment in crons).

We need a way to propagate these configurations from sentry to each of the checkers (there may be N checkers).

Approach

We will produce configuration messages into a kafka topic uptime-configurations. This topic will be used as persistent storage for configurations. See Is it ok to store data in Kafka?.

This topic will receive configurations as they are created / re-configured via sentry. The topic will have an indefinite retention window and will use Log Compaction to clear out older configuration messages in favor of the most recently produced configuration message. Tombstones for configurations that have been removed will allow for deletion of configurations that are no longer needed.

The consumer that is part of each uptime-checker will work by reading all configuration messages at boot. Importantly, consumers will never commit offsets. Each time a checker boots, it will read the configurations for the partitions it is assigned to. This does mean a checker will need some time to boot.

[!NOTE] We profiled this: we can read 1m configs (that look similar to what production configs may look like) from a single partition in a Kafka topic in 13.5s

Advantages to this approach

Disadvantages

Producing configurations from sentry

We will use an Outbox approach to ensure uptime configurations are written to Kafka. This is necessary to ensure eventual consistency of configurations. This will also be used for deletions for creating tombstones that will be compacted away.

fpacifici commented 1 week ago

A trade off to keep in mind regarding restarts (at every deployment). When we restart a Kubernetes deployment we do a rolling restart. There are good reasons to avoid a scenario where we stop the previous version and only then start the new version: if the new version does not start we are down. Blue green is hard to achieve in Kafka as the new consumers trigger rebalancing as soon as they are up.

In a nutshell, in Kubernetes this means that the new pods are started, when the pod's readiness check passes (this is controllable by the application) K8s will start directing traffic to the pod, if the pod responds to requests - Kafka is different as it is a poll system. Once enough pods are ready K8s sends a SIGTERM to the old ones which terminate cleanly.

The Kafka approach:

The K8s deployment approach behind a Service:

The statefulset approach behind a Service: