registry for use with stateless services

brharrington commented 7 years ago

With things like quick scripts or FaaS the lifetime is small. The current registries maintain state over a period of time and flush for that time. This is not applicable for many short lived use-cases. For these it would likely need to just flush the deltas to something like the internal side car.

/cc @jkschneider

brharrington commented 6 years ago

If you look at the breakdown of operations we would need:

Core Types

Counter
- increment $amount
Timer
- Composite type with 3 counters and 1 max gauge
- increment $amount
- max $value
DistributionSummary
- Same as timer
Gauge
- Generally these are a bit harder to aggregate
- One option is to map them to DistributionSummary

Pattern Types

LongTaskTimer/IntervalCounter
- Gauge, also probably not useful in a short duration lambda
- For aggregator type use-case, max on distribution summary would likely suffice, may need to adjust statistic tag
Percentile{Timer,DistributionSummary}
- Counter based

Mapping to AtlasRegistry

If we send the basic events to a service, could we load them into an AtlasRegistry (#514)?

Simply update counters and max gauge type
- Service receives increment $amount and simply applies it to the counter just like a user would, similar for max $value
- Need to add a max gauge to public interface, currently it is only internal detail
- Easy, but doesn't preserve typing on the service. That is it wouldn't know that these N counters are parts of a Timer on the source. Does this matter for any of the critical use-cases?
Send the events grouped by origin type
- Need to add some way to update the constituent parts of the timer or other composite type
- Downside is this exposes internal details of how these types are composed making them harder to evolve later on

jkschneider commented 6 years ago

Are we talking about an intermediate service that deserializes values into normal Spectator types which are pushed to Atlas at a preconfigured interval like normal?

brharrington commented 6 years ago

Yes it would be an intermediate service. I suppose for simple small scale use-cases it might be possible to run them colocated in the same process though.

pparth commented 6 years ago

This architecture, as far as i understand, still requires the business services to maintain some form of Spectator Atlas metrics. Wouldn't it be better to devise a format where the events are even more raw and type-agnostic? Metrics are maintained as a result of an event that happens in the system. All the required data can be mapped to a set of attributes, that is, key-value pairs. Some of them are used in metric value calculation and others are used in dimensions. For example, if i want to maintain an incoming request latency metrics, the response.latency is used for metric value calculation and request.method, response.status, deployment.instanceId, registration.service.name etc. may be the dimensions. If we can somehow define a schema, can't we just use it to send this attribute set and then be able to use it on the other side to maintain a metric?

brharrington commented 6 years ago

Sure, you could have a registry that produces events. I have an internal implementation of one that just logs when something increments a counter or records a latency for a gauge. Similar idea, but it would need to have more flexibility to send somewhere other than a logger. If that is generally something that would be useful it is worth considering as an option.

I don't think it would fit our use-cases here though. Event systems have the problem that the processing has to scale with the amount of activity or use sampling to reduce the volume of data. We have many use-cases where counters and timers are updated frequently and we want the updates to be cheap enough we can process them without sampling so we have an accurate overall view. So we typically aggregate the values and send a single datapoint that potentially represents a lot of activity.

In the proposal I'm making rather than discrete events there are essentially two aggregate events:

{type: increment, id, amount}
{type: max, id, amount}

The registry would flush at some rate, suppose it is every second. If locally I have 1 million updates per second across 10 distinct basic timers, then that would result in 40 events (10 timers * 4 stats per timer) being sent each second.

In a pure event system, it would look something like:

{type: counter, id, amount}
{type: timer, id, duration}
{type: distribution-summary, id, amount}
{type: gauge, id, value}

For the same example this would result in 1 million events being sent each second. Or more likely, it would start getting sampled.

pparth commented 6 years ago

The problem with the aggregate events is the dimensions. The use of aggregate events assume that:

the target, stateful service, has a full knowledge of all dimensions
All dimensions are static, that is, are applied initially and they never change.

Both of these assumptions are not valid, i believe. The aggregator service cannot know all the dimensions beforehand and there are dimensions set per e.g. request.

That is why i was talking about a pure event system where events also contain dimension key-value pairs. The granularity of the generation of events should be equal to the granularity of metric update frequency. In web services, this is customary the request level.

I understand that for Netflix load, this is probably out of the question. For us, and most of other use cases i think, having up to 10,000 events per second is not something unmanageable. What do you think?

brharrington commented 6 years ago

Neither proposal would force the service to have full knowledge of all dimensions in advance or have them be fully static. In either scheme you could think of the simplest processor for an event on the service being to just read it and update the local registry. For example, processing the event would look something like:

void processEvent(Event event) {
  switch (event.type()) {
    // for aggregate event
    case INCREMENT: registry.counter(event.id()).increment(event.value()); break;
    case MAX:       registry.maxGauge(event.id()).update(event.value()); break;

    // for pure event per update
    case COUNTER:    registry.counter(event.id()).increment(event.value()); break;
    case TIMER:      registry.timer(event.id()).record(event.value(), TimeUnit.NANOSECONDS); break;
    case DIST:       registry.distributionSummary(event.id()).record(event.value()); break;
    case GAUGE:      registry.gauge(event.id()).set(event.value()); break;
  }
}

The id is the name and set of tags (dimensions). Just like updating the existing AtlasRegistry locally nothing is forcing a fixed set of dimensions. One useful aspect for us though is if the ids happen to be the same across instances reporting to the service then we can just leave those dimensions off either client side or via server side transform on the id and it will automatically aggregate when applied.

pparth commented 6 years ago

Ah yes. I forgot that each unique dimension set is essentially what constitutes a metric. The aggregate events should work for our use-case well then. The step size though, should be configurable.

pparth commented 6 years ago

@brharrington Anything new about that? Is there a version of Spectator Client + Atlas Aggregator that we can test? We are really interested into this one, to incorporate it in a stream-based alerting system.

Netflix / spectator

registry for use with stateless services #432