JeffersonLab / jaws

An alarm system built on Kafka that supports pluggable sources
https://ace.jlab.org/jaws
MIT License
7 stars 0 forks source link

Topic tuning #43

Open slominskir opened 2 years ago

slominskir commented 2 years ago

A few tuning parameters to consider:

__consumer_offsets topic Kafka stores consumer offsets in the __consumer_offsets topic. We always rewind all JAWS topics to the beginning and ignore consumer offsets. Further, we currently create random groupIds each time a script is run. This means we're adding lots of consumer offset data and using none of it. See topic config: offsets.retention.minutes, defaults to 7 days. We may want to revisit our strategy of using random groupIds or aggressively clean older offset data, or both. Also, probably want to disable auto-commit of offsets (no need to waste resources committing and storing offsets). See: https://github.com/confluentinc/confluent-kafka-python/issues/250

alarm_activations topic Unlike many of our registration related topics, our notification related topics contain frequently updating data. The activations topic uses tombstones to indicate not activated. This is a problem. The computed (effective) topics notifications and alarms instead have a state field so do not suffer the tombstone problem. There is a difference between a deleted record which is likely not coming back vs a temporarily unset record. We're using tombstones for both, and it means clients must potentially replay a large number tombstones to get the latest state. Compaction eventually removes all tombstones, so we can't say "aggressively compact a topic, but always leave the latest value, even if it's a tombstone". Kafka doesn't work that way, and if it did deleted records would never truly be deleted as everything you've ever deleted would be stored as a tombstone forever. We do want somewhat aggressive compaction for these frequently updating topics since old alarm data isn't worth much (except for an audit/archiving process) - most clients really want the latest alarm state and want it fast without replaying a ton of old alarm data.

But much of the "noise" data are tombstones, and this is a problem because we require hanging onto the latest value, even if a tombstone, for at least a reasonable cache period (maybe the 1 day default is fine for delete.retention.ms). For example, a materialized view, a cache, is bounded by the amount of time that tombstones are retained and becomes stale afterwards (see: jaws-admin-gui IndexDB). We may want to switch these quickly updating topics to use a different "NO_ALARM" marker instead of tombstones such that they can be compacted aggressively (TTL just long enough to support reboot of audit/archiver, maybe 5 minutes?) - tombstones cannot be compacted that aggressively.

slominskir commented 2 years ago

Note: effective topics are not subject to frequent tombstones so if clients are primarily intended to monitor those topics then the other topics may be consider semi-internal and we can be less concerned with tombstone cleanup behavior. It may be reasonable to indicate that the alarm-activations and alarm-overrides topics are not intended to be monitored by end-users - instead monitor the effective-activations topic (possibly renamed effective-notification). Currently monitoring those two topics incurrs a heavy "catch-up" processing cost due to frequent use of tombstones, which cannot be compacted. Using a "NO_ALARM" marker for alarm-activations is a reasonable fix for that topic, but a similar "NO_OVERRIDE" marker doesn't work so well for the alarm-overrides topic as there are roughly 25,000 alarms and 7 overrides at the moment meaning the overrides topic would ballon to a roughly fixed 175,000 messages vs the emperically observed number starting around 2000 and grows to 8000 by the end of the day, at which time the tombstones are cleared.

slominskir commented 2 years ago

Whether to use NoActivation vs tombstone may need to be configurable as it depends on workload. If you have a well-behaved alarm system (i.e. few nuisance alarms), then the tombstone approach is likely better. Specifically, if most of your alarms annunciate infrequently then the the alarm-activations topic will have few messages and few tombstones and the overall number of messages will be low. If using NoActivation messages instead, then the number of messages will be at LEAST equal to the number of registered alarms at all times, BUT due to aggressive compaction won't grow too much.

slominskir commented 5 months ago

Closing as the NoActivation approach appears ideal in practice. We can create separate tuning issues for future tuning needs.

slominskir commented 5 months ago

Actually, we still need to dig into this. Especially with regards to consumer offsets generated on the Active tab of admin GUI. Not completed.