Closed brharrington closed 6 years ago
If you look at the breakdown of operations we would need:
Core Types
increment $amount
increment $amount
max $value
Pattern Types
Mapping to AtlasRegistry
If we send the basic events to a service, could we load them into an AtlasRegistry (#514)?
increment $amount
and simply applies it to the counter just like a user would, similar for max $value
Are we talking about an intermediate service that deserializes values into normal Spectator types which are pushed to Atlas at a preconfigured interval like normal?
Yes it would be an intermediate service. I suppose for simple small scale use-cases it might be possible to run them colocated in the same process though.
This architecture, as far as i understand, still requires the business services to maintain some form of Spectator Atlas metrics. Wouldn't it be better to devise a format where the events are even more raw and type-agnostic? Metrics are maintained as a result of an event that happens in the system. All the required data can be mapped to a set of attributes, that is, key-value pairs. Some of them are used in metric value calculation and others are used in dimensions. For example, if i want to maintain an incoming request latency metrics, the response.latency is used for metric value calculation and request.method, response.status, deployment.instanceId, registration.service.name etc. may be the dimensions. If we can somehow define a schema, can't we just use it to send this attribute set and then be able to use it on the other side to maintain a metric?
Sure, you could have a registry that produces events. I have an internal implementation of one that just logs when something increments a counter or records a latency for a gauge. Similar idea, but it would need to have more flexibility to send somewhere other than a logger. If that is generally something that would be useful it is worth considering as an option.
I don't think it would fit our use-cases here though. Event systems have the problem that the processing has to scale with the amount of activity or use sampling to reduce the volume of data. We have many use-cases where counters and timers are updated frequently and we want the updates to be cheap enough we can process them without sampling so we have an accurate overall view. So we typically aggregate the values and send a single datapoint that potentially represents a lot of activity.
In the proposal I'm making rather than discrete events there are essentially two aggregate events:
{type: increment, id, amount}
{type: max, id, amount}
The registry would flush at some rate, suppose it is every second. If locally I have 1 million updates per second across 10 distinct basic timers, then that would result in 40 events (10 timers * 4 stats per timer) being sent each second.
In a pure event system, it would look something like:
{type: counter, id, amount}
{type: timer, id, duration}
{type: distribution-summary, id, amount}
{type: gauge, id, value}
For the same example this would result in 1 million events being sent each second. Or more likely, it would start getting sampled.
The problem with the aggregate events is the dimensions. The use of aggregate events assume that:
Both of these assumptions are not valid, i believe. The aggregator service cannot know all the dimensions beforehand and there are dimensions set per e.g. request.
That is why i was talking about a pure event system where events also contain dimension key-value pairs. The granularity of the generation of events should be equal to the granularity of metric update frequency. In web services, this is customary the request level.
I understand that for Netflix load, this is probably out of the question. For us, and most of other use cases i think, having up to 10,000 events per second is not something unmanageable. What do you think?
Neither proposal would force the service to have full knowledge of all dimensions in advance or have them be fully static. In either scheme you could think of the simplest processor for an event on the service being to just read it and update the local registry. For example, processing the event would look something like:
void processEvent(Event event) {
switch (event.type()) {
// for aggregate event
case INCREMENT: registry.counter(event.id()).increment(event.value()); break;
case MAX: registry.maxGauge(event.id()).update(event.value()); break;
// for pure event per update
case COUNTER: registry.counter(event.id()).increment(event.value()); break;
case TIMER: registry.timer(event.id()).record(event.value(), TimeUnit.NANOSECONDS); break;
case DIST: registry.distributionSummary(event.id()).record(event.value()); break;
case GAUGE: registry.gauge(event.id()).set(event.value()); break;
}
}
The id is the name and set of tags (dimensions). Just like updating the existing AtlasRegistry locally nothing is forcing a fixed set of dimensions. One useful aspect for us though is if the ids happen to be the same across instances reporting to the service then we can just leave those dimensions off either client side or via server side transform on the id and it will automatically aggregate when applied.
Ah yes. I forgot that each unique dimension set is essentially what constitutes a metric. The aggregate events should work for our use-case well then. The step size though, should be configurable.
@brharrington Anything new about that? Is there a version of Spectator Client + Atlas Aggregator that we can test? We are really interested into this one, to incorporate it in a stream-based alerting system.
With things like quick scripts or FaaS the lifetime is small. The current registries maintain state over a period of time and flush for that time. This is not applicable for many short lived use-cases. For these it would likely need to just flush the deltas to something like the internal side car.
/cc @jkschneider