Open evanpurkhiser opened 2 weeks ago
A trade off to keep in mind regarding restarts (at every deployment). When we restart a Kubernetes deployment we do a rolling restart. There are good reasons to avoid a scenario where we stop the previous version and only then start the new version: if the new version does not start we are down. Blue green is hard to achieve in Kafka as the new consumers trigger rebalancing as soon as they are up.
In a nutshell, in Kubernetes this means that the new pods are started, when the pod's readiness check passes (this is controllable by the application) K8s will start directing traffic to the pod, if the pod responds to requests - Kafka is different as it is a poll system. Once enough pods are ready K8s sends a SIGTERM to the old ones which terminate cleanly.
The Kafka approach:
The K8s deployment approach behind a Service:
The statefulset approach behind a Service:
For uptime monitoring multiple domains we need some way to tell the configuration about each of checks it needs to be making. We will do this through a
CheckConfiguration
each check configuration will map back to asubscription_id
in sentry (similar to a monitor_environment in crons).We need a way to propagate these configurations from sentry to each of the checkers (there may be N checkers).
Approach
We will produce configuration messages into a kafka topic
uptime-configurations
. This topic will be used as persistent storage for configurations. See Is it ok to store data in Kafka?.This topic will receive configurations as they are created / re-configured via sentry. The topic will have an indefinite retention window and will use Log Compaction to clear out older configuration messages in favor of the most recently produced configuration message. Tombstones for configurations that have been removed will allow for deletion of configurations that are no longer needed.
The consumer that is part of each
uptime-checker
will work by reading all configuration messages at boot. Importantly, consumers will never commit offsets. Each time a checker boots, it will read the configurations for the partitions it is assigned to. This does mean a checker will need some time to boot.Advantages to this approach
It’s easy to have multiple schedulers. We would partition the config topic via the key to guarantee that all updates for a particular uptime monitor go to the same scheduler.
Updates are received in real time (no polling), so we can update checks more quickly
No need to fetch huge numbers of configs from an api in Sentry, and no need to worry about ttls, stale configs and refetching.
Rebalancing is useful. We will have multiple schedulers, and if one scheduler goes down then Kafka will rebalance those partitions to the remaining consumers. If we used a database storage method we'd need to figure out how to partition work, and what to do when a worker goes down.
Disadvantages
Producing configurations from sentry
We will use an Outbox approach to ensure uptime configurations are written to Kafka. This is necessary to ensure eventual consistency of configurations. This will also be used for deletions for creating tombstones that will be compacted away.