Design Doc Question - Githubissues

For background I have done a bunch with metric systems that other people wrote and wrangling data out of them, but can't offer much on the details of writing node libraries.

Is the intention to expose this to consumers over http (as is typical with Prometheus), or something with CMON/RFD95? If just over http, over what interface?
- There is a potentially a lot of overlap between metrics, distributed tracing, and logging. I don't think anyone has figured out exactly what that looks like yet. But in the fullness of time there will be some combination of library convergence and double instrumentation. I don't advice getting blocked on that (tracing is coming to Triton "Soon", while we need basic metrics in Manta Now), but wanted to make sure you were are of it.
My understanding of the Prometheus naming conventions is to provide the same name everywhere (ie http_request_duration_seconds) everywhere with different labels. That way it is always clear how to lookup "standard" metrics. By namespacing everything (marlin_) are we purposefully choosing not to follow that convention?
In cases where we run multiple node processes in the same zone, how will they be differentiated.
If a library is instrumented, what happens when no parent collector is passed in? Does everything become a NOP?
I'm fuzzy on how Prometheus histograms work. Are they calculated in the client, or are bucket counts passed to the server. Are buckets fixed things we need to choose ahead of time? Will we able to say "the 95th" percentile for service X is Y, or only reason about a specific instance of service X?
Should each metric be a DTrace probe a-la bunyan? (If buckets are otherwise fixed, would being able to have DTrace make a histogram as needed mitigate that?)

Hi Chris,

Thanks for the feedback! This is very helpful. You brought up some things that I haven't thought of, so I'll be sure to think on them and update the document as appropriate.

To scope it a little bit, this design doc is focused on the metric client library itself. With that in mind, the library doesn't care if metrics are exposed over http (restify), tcp (fast), or something else. That would be an application-level decision. W.r.t. Manta, the longer-term vision is that this would funnel through CMON (how CMON does discovery of applications exposing metrics is another big question to be answered).

Yes, there will be overlap in the distributed tracing efforts and adding metrics to applications. Some of the data collected will be collected for both distributed tracing, and metrics. If I understand correctly, is distributed tracing primarily targeting debugging? I'll have to read more on the distributed tracing project.

Prometheus naming conventions are detailed here: naming conventions . You're absolutely correct! Since we haven't gotten to the level of actually integrating this into an application, I haven't thought much about naming conventions. I'll change the design doc to make it more conventional. For the histogram example, the metrics should be named http_request_latency rather than muskie_audit_request_latency. On the other hand, the marlin example (marlin_agent_jobs_running) seems fine, since it is an application-specific metrics and not a general metric, so I'll leave that as is. I can't immediately think of a reason not to follow convention.

The multiple process/zone issue is an interesting problem. We do recognize this as an issue, and it's part of the broader 'discovery' problem when we get into exposing metrics from applications. What applications are providing metrics? What zones are they in? How do we determine which processes in a given zone are providing metrics? How can I contact that application (tcp, http, etc)? Those are all going to be tricky questions to answer. If you have any ideas, definitely let Richard Kiene or myself know :) . I don't know that discovery is something this metric client library should handle.

If a library is instrumented an no collector is passed in, it should be a NOP. I'll have to think about how this would work. At first thought, it should be the application library should handle, and not this metric client library, but I'll have to think more on it. My first thought is that if no parent collector is passed in, the metric library won't even get a hook to perform a NOP (as it would be an undefined reference).

Ah! YES! Histograms are problematic! Here's a link to some thoughts that I collected earlier: gist Histograms are given static buckets. If we create our buckets improperly, we will lose some information. However, that only matters for some types of operations on histograms. If all we care about is tracking latency over time, we don't really need any buckets at all. Latency over time can be calculated by doing a query like rate(http_latency_sum[5m])/rate(http_latency_count[5m]). In that case we're just using the built-in buckets (which this metric client library supports). If we wanted to run a query to satisfy your example, we'll have to be very thoughtful of our bucket configuration when we implement metric collection in our applications.

We'll have to investigate what queries we want to run, and then figure out what to do about buckets. That's an application-level thing at this point unless we decide to implement something from the gist I linked. At this point, I don't believe other metric client libraries support dynamic bucket configuration.

Using DTrace to collect metrics is probably not something we want to do. There was some discussion about it internally (before my time), but I think there were some performance problems with it. DTrace may be something we could use to fix the histogram bucket problem. I don't have the DTrace knowledge to make a call on that.

I think it would be cool to (like Bunyan) have 'metric levels.' For example, maybe in prod the only metric we want to collect is muskie request latency in milliseconds, with labels for request type, http response code, and the zone name. Maybe in dev or staging we also care about the number of http requests (as a metric), and the particular filenames (as labels). To accomplish this, we could set a 'metric level' in prod and dev/staging to different values. This metric client library could NOP the metrics that we don't care about in production, while using DTrace to give us the ability to view NOP'd metrics ad-hoc if needed (like bunyan -p). That's a very intriguing idea. I'm not sure how useful it would be... Something to think about, anyway.

Thanks again for the comments! Sorry that my reply got lengthy. This was very helpful for me to think through things. I have lots to think about, and I'll update the design doc as I progress. Definitely let me know if you have any comments about what I've written here. I'm sure you have more experience with metrics than I do, so I am very interested to hear your comments.

(edited to fix dtrace wording/explanation)

The initial use case for distributed tracing is to figure out what architectural changes we need to make provisioning faster, and to measure whatever changes we make to verify they actually work. A manta job might identify interesting traces and we look at them in the morning to choose the next lowest hanging fruit, or something like that. Eventually we want to tighten that up so the tracing data can be used to understand issues that are occurring Right Now. All of that rich tracing data might also reveal things are useful to expose as either a "metric" or KPI that would otherwise be difficult to calculate, but that's getting more speculative.

I don't know that discovery is something this metric client library should handle.

Fair enough on the bleeding into discovery point. But some convention in either this library (label with pid or listening port) or external to it (something else in the stack is expected to generate a unique id for each metric generating "thing") needs to generate unique metrics. (Maybe I'm thinking of some other part of our stack that uses multiple nodejs instances per zone, but my recollection was that it was a pattern that occurred in manta.)

A related note that is probably more Best Practices and not artedi per se, is if metric-generating-things should be labeled by logically labels (7 of 9), unique instance IDS (zoneid 123, that will change on each redeploy), or both? This might tie into how useful historical metrics are expected to be and how long they would be retained.

Regarding DTrace I didn't mean to suggest that we would collect metrics via DTrace (like CloudAnalytics). Rather that each metric could maybe be a probe. For example, if an instance we behavior poorly and in a way that the pre-chosen histogram buckets were not telling us as much as we would like, someone could login to that one zone and use DTrace to determine what is "really" happening right now by doing it's own histogram calculation on the same input.

Ah, okay. This clears up a lot of things. Thanks for the reply!

Distributed tracing continues to sound awesome. I'll continue to do research, but they do sound like separate (but overlapping) use cases.

Yes, Manta does have multiple node processes per application per zone. The number per zone seems to vary from dev to prod. To this point I've thought that discovery would be totally external to this library. I'll think more on this decision, because discovery is probably going to be the most difficult part of bringing this into prime time at scale. We need to make sure we get that right. I would consider a label that contains a pid or listening port to be external to this library. I think it is also possible to automatically append labels at the time of scraping (on the server side). I thought I saw something like that in the prometheus or telegraf docs.

A related note that is probably more Best Practices and not artedi per se, is if metric-generating-things should be labeled by logically labels (7 of 9), unique instance IDS (zoneid 123, that will change on each redeploy), or both?

We're currently thinking that we will use zoneids. If you have some thoughts on logical vs instance IDs, that would be good to hear. I don't think we've thought much about that.

Regarding DTrace

Ok, awesome. Yeah, this is a good idea. I would say this isn't something we'll do for v1, but something to put on the backlog. DTrace visibility is always good to have, especially if we essentially have to 'guess' at bucket values when we check in code.

I would be careful about using a zone's id. It will change and increment each time that the zone (or broader system) reboots. If we're looking to figure out how to identify components, we should probably better sync on that with a broader conversation as there'll be some intersections between tracing, the monitoring stuff here, fault management, etc.

Thanks for the feedback! I'm going to close this, since we've talked quite a bit since this discussion. RFD 99 has gone through a number of revisions in this time.

TritonDataCenter / node-artedi

Design Doc Question #2