Aggregated metrics - Githubissues

KodyKantor commented 7 years ago

Currently, node-artedi serializes and exports metrics in the same form that the user specifies. Take this example:

counter.increment({
// Increment this counter with three labels.
    one: 'value',
    two: 'value',
    three: 'value'
})

Every time we call collector.collect() the three labels ('one,' 'two,' 'three') will be serialized for this counter. This is great for simple applications. For more complex applications, and to reduce the possibility of accidentally creating one-off metrics, we've thought of another solution. The user could specify the labels that they would like collected at the time of collector creation. Any operations on labels that were not statically defined (dynamic labels) will cause the dynamic labels to be ignored. For example, the user could define the labels 'one' and 'two' as labels to keep. Label 'three' would be aggregated away. This helps to resolve a couple problems: metric cardinality, and composition.

Metric cardinality is the hallmark issue in metric collection systems of our size. The more labels we use, the more data we maintain in memory, and serialize. The more data we serialize, the slower our queries are (and we store more data on disk at the server).

Composition is a problem that arises when we pass instances of an object between libraries. There are multiple problems that come up with composition, but for this topic we are focusing on the metrics that libraries collect on behalf of the greater application. Let's say that we instrument the node-cueball library. The node-cueball library may create labels for all sorts of internal data that is only important to developers. For example, maybe it creates a label for 'remoteIP'. remoteIP may be a reasonable thing to collect in Muskie, but not in CloudAPI due to the number of possible values for remoteIP. By declaring static labels in CloudAPI, we can ensure that 'remoteIP' is not tracked, which will help us avoid the metric cardinality problem.

There are a couple options for what we do if only a subset of the static labels are assigned values.

Option 1) Drop the labels. The unspecified labels would simply be dropped.

Option 2) Assign default values. The user could define default values up front, which will be filled in if they are not given specific values later.

We also need to decide where we define static labels. I can think of three options for this.

1) Assign static labels when the parent Collector object is created. This option is the only option that allows us to reliably control which metrics are serialized by libraries that we are instrumenting. By defining which labels to track in the parent, all children are required to comply. The difficulty with this is that if we want to get a new set of labels out of a library, the wrapping library/application also requires a code change (to allow the new label to be serialized). This makes upgrading difficult and slow. On the other hand, we avoid serializing (and maintaining in memory) possibly useless labels

2) Assign static labels when the child collectors (gauge, counter, histogram) are created. This option gives us more flexibility, and avoids the 'upgrade' problem from option 1). In this option, control of labels is given to each library. We're really trying to avoid this, since this doesn't solve the problem.

3) Assign static labels when the child collectors have their observation functions called (increment(), observe(), etc). This would be identical to what is currently done, and would be the same as what we call 'dynamic labels.' This doesn't solve the problem, so we will not take this option.

After considering this, I propose that we assign static labels when the parent Collector object is created, and drop static labels that are not used. Those are option 1) from both sets of questions. The exact syntax for this has to be worked out.

After this is implemented, our result will look something like this (syntax will change):

var collector = mod_artedi.createCollector({
  staticLabels: [ 'method', 'statusCode' ]
});

var counter = collector.counter({
  name: 'some_counter',
  help: 'some help'
});

counter.increment({
  method: 'putobject',
  statusCode: 203,
  user: 'kkantor',
  remoteIP: '1.2.3.4'
});

counter.increment({
  method: 'getobject'
});

collector.collect(function (err, str) {
  console.log(str);
  // Prints:
  // # HELP some_counter some help
  // # TYPE some_counter counter
  // some_counter{method="putobject",statusCode="203"} 1
  // some_counter{method="getobject"} 1
});

As suggested by Dave Pacheco, we can introduce a DTrace probe to node-artedi to observe raw metric actions as they are occur before we aggregate labels away. This will minimize the blow from trading granularity (dropping labels) for performance/scalability (metric cardinality).

KodyKantor commented 7 years ago

After looking into this for a while, I think that we can consume node-skinner. I believe by refactoring to include node-skinner, we can completely get rid of the metric and metric_vector objects. metric_vector essentially did data point aggregation, albeit with less flexibility, and was built on top of the data point primitive of the metric class.

By implementing this, the API will be expanded to include the 'staticLabels' argument in the createCollector constructor, which will act as node-skinner's streaming 'decomps' argument.

KodyKantor commented 7 years ago

I've created (and started on) a number of tickets in node-skinner to facilitate this work:

joyent/node-skinner#8
joyent/node-skinner#9
joyent/node-skinner#10
joyent/node-skinner#11
joyent/node-skinner#12

node-skinner provides two exposition formats: datapoints and arrays.

Aggregated datapoints look like this:

{"fields":{"lat":[0,9],"host":"host2","path":"putobject","method":"PUT"},"value":2}
{"fields":{"lat":[10,19],"host":"host1","path":"putobject","method":"PUT"},"value":1}
{"fields":{"lat":[30,39],"host":"host1","path":"getobject","method":"GET"},"value":1}
{"fields":{"lat":[50,59],"host":"host1","path":"putobject","method":"PUT"},"value":1}
{"fields":{"lat":[80,89],"host":"host1","path":"putobject","method":"PUT"},"value":1}
{"fields":{"lat":[80,89],"host":"host2","path":"putjobsobject","method":"PUT"},"value":1}
{"fields":{"lat":[90,99],"host":"host2","path":"getobject","method":"GET"},"value":1}

and arrays look like this:

[[0,9],"host2","putobject","PUT",2]
[[10,19],"host1","putobject","PUT",1]
[[30,39],"host1","getobject","GET",1]
[[50,59],"host1","putobject","PUT",1]
[[80,89],"host1","putobject","PUT",1]
[[80,89],"host2","putjobsobject","PUT",1]
[[90,99],"host2","getobject","GET",1]

I'm still unsure which exposition format node-artedi should use. The array-based output is much more terse, but node-artedi needs to change the exposition format again to match prometheus, or JSON, or whatever the user specifies. For that reason it may be useful to have the field names alongside the data, which is the case for datapoints.

TritonDataCenter / node-artedi

Aggregated metrics #4