appcelerator-archive / ampnext-discussion

This is a placeholder repo for discussions and prototyping around the next version of amp.
0 stars 0 forks source link

metrics #8

Open subfuzion opened 7 years ago

subfuzion commented 7 years ago

Proposal

Support [Prometheus] monitoring only. All of our own code should be instrumented as appropriate to support time-series reporting using the Go client and if necessary for any short-lived or batch jobs, Prometheus Pushgateway Push and PushAdd functions for data that isn't scraped.

The Prometheus dashboard should not start running by default, but only on demand, unless configured to do so. There should be a command to start running the Prometheus service. Also consider support for exposing data or potentially registering an external metrics destination (perhaps by leveraging Pushgateway) for external consumers.

Rationale

amp recently provided support for Prometheus, but also supports metrics distribution, retention, and queries that requires NATS, Elasticsearch, and maintenance of our custom query support. Prometheus has become the preferred solution in the Docker community and the project is part of the Cloud Native Computing Foundation (CCNF).

Our implementation requires that we depend on and start a number of fairly heavy external dependencies for message queuing and storage, plus requires maintenance of a per-node agent, backend query service, query library, and query command. Eliminating these improves system performance/overhead and engineering maintenance.

amp currently waits for the monitoring stack to be ready (although we recently added a --no-monitoring option). The default should be no monitoring unless configured for a deployment. The default for the backend should be to start as quickly as possible. It should be easy to start running the dashboard with a command, when desired. Even when configured to start running with an amp deployment, it should be started last and should not delay the system from accepting requests.

ndegory commented 7 years ago

About the support for registering an external destination, why not doing it the Prometheus way? The hierarchical federation architecture allows Prometheus to be scraped by an external Prometheus (or any system who knows how to use the Prometheus metrics). We would still have to protect the endpoint though, especially for cloud deployments.

subfuzion commented 7 years ago

@ndegory This proposal is about getting consensus on focusing exclusively on Prometheus and dropping our other telemetry/stats code and dependencies. As far as meeting potential external collection needs, I don't see it as a priority right now anyway other than to consider how in fact this might be accomplished; and I definitely agree that the default option in this area would be to allow external scraping (although the way I currently understand it, that's more about supporting scale than supporting external consumers). But not all consumers will be pull-based, so I'm also mentioning the possibility that we may need to consider providing some kind of push adapter (or not -- that can be the consumer's responsibility). I guess the thing is, are we all in consensus that we are going to focus solely on Prometheus -- and if we do, will we be making things difficult for some of our potential adopters? My vote is Prometheus only, provide a mechanism to expose data for external scraping, but also speak with other potential adopters (eg, Visibility at Scale) to understand their needs.

ndegory commented 7 years ago

TL;DR My vote is also Prometheus only.

Prometheus is becoming a standard for metrics in open source projects, so that seems like a good idea to focus on it. We won't be able to support all metrics format, if the external system requires metrics from metricsbeat/elasticsearch, that would be a significant effort to do the transformation, and there's many other formats. Providing metrics in one of the standard (and most used) format is the best we can do, and not having to push to an external system removes a dependency on our part, we don't have to care if the external system is up and is accepting our data, and we don't care if the external system is pulling our data, it doesn't impact our system.

ndegory commented 7 years ago