aspnet / Logging

[Archived] Common logging abstractions and a few implementations. Project moved to https://github.com/aspnet/Extensions
Apache License 2.0
506 stars 247 forks source link

Add support for reporting application metrics via logging #708

Closed analogrelay closed 7 years ago

analogrelay commented 7 years ago

Overview

We want to enable applications to emit arbitrary named metrics and have them captured by telemetry/logging systems (App Insights, Elasticsearch/ELK, etc.) as well as other metrics systems (Statsd, InfluxDB, etc.). An abstraction is desired in order to allow applications to emit metrics without being coupled to the destination of the metrics, particularly with features like App Insights auto-lightup where the app doesn't directly depend upon App Insights.

Proposal

We will add support for metrics reporting to ILogger. It will be done via a new companion interface IMetricsLogger, which loggers may optionally implement. Applications can emit metrics through this interface (see API below), and logger providers which support metrics can receive and process those metrics. Metrics are (name, value) tuples, where value is always floating point number. Metrics are expected to be viewed in aggregate, hence it doesn't make sense to use a generic data type like object for the metric, so metric values are always doubles. Supporting arbitrary types of metric values is a non-goal. Reporting metrics should be as efficient as possible, especially in the case where no logger provider is registered to listen for them, or they are filtered out.

Why add to ILogger?

Initially, I considered developing a new abstraction for Metrics. However, I encountered a few concerns:

  1. We would need to develop a structure for naming/organizing metrics (likely using type/namespace names as a base)
  2. We would need to develop a structure for filtering metrics (by categories, names, granularities, etc.)
  3. Many telemetry systems provide support for both log and metric collection, and would have to provide a new system
  4. Consumers have to pull in two different diagnostics reporters from DI (ILogger<T>, IMetricReporter<T>, etc.)
  5. This would create Yet Another Diagnostics Abstraction (on top of EventSource, EventCounters, DiagnosticSource, Perf Counters, yada yada yada (see what I did there? ;P))

The fact that 1 and 2 above already exist in the Logging pipeline, and many metric sinks are also logging sinks, led to the idea of integrating metrics into Logging rather than creating a completely new abstraction.

The main reason why we would not want to add this to ILogger would be a risk

The API

IMetricLogger

Optional companion interface for ILogger implementations to indicate their support for metrics. The ILogger instances returned by ILoggerProvider are expected to implement this interface if and only if they wish to receive metrics.

namespace Microsoft.Extensions.Logging
{
    public interface IMetricLogger
    {
        /// <summary>
        /// Define a new metric with the provided name and return an <see cref="IMetric"/> that can be used to report values for that metric.
        /// </summary>
        /// <param name="name">The name of the metric to define</param>
        /// <returns>An <see cref="IMetric"/> that can be used to report values for the metric</returns>
        IMetric DefineMetric(string name);
    }
}

IMetric

Interface that allows metric data to be reported. Metric values

namespace Microsoft.Extensions.Logging
{
    public interface IMetric
    {
        /// <summary>
        /// Record a new value for this metric.
        /// </summary>
        /// <param name="value">The value to record for this metric</param>
        void RecordValue(double value);
    }
}

LoggerMetricsExtensions

Provides an extension method on ILogger to enable consumers to emit metrics, even if the underlying logger provider doesn't support it (of course, the metrics will be ignored in that case).

namespace Microsoft.Extensions.Logging
{
    public static class LoggerMetricsExtensions
    {
        /// <summary>
        /// Define a new metric with the provided name and return an <see cref="IMetric"/> that can be used to report values for that metric.
        /// </summary>
        /// <remarks>
        /// If none of the registered logger providers support metrics, values recorded by this metric will be lost.
        /// </remarks>
        /// <param name="name">The name of the metric to define</param>
        /// <returns>An <see cref="IMetric"/> that can be used to report values for the metric</returns>
        public static IMetric DefineMetric(this ILogger logger, string name)
        {
            if(logger is IMetricLogger metricLogger)
            {
                return metricLogger.DefineMetric(name);
            }
            return NullMetric.Instance;
        }
    }
}

MetricValueExtensions

Extension methods for IMetric to support other common metric value types (as mentioned above, the underlying type is still double, so the value must be representable as a double).

Suggestion: We could have a .Time() API that returns an IDisposable which uses a Stopwatch to do the timing for you as well. That would be easy to add later though.

using System;

namespace Microsoft.Extensions.Logging
{
    public static class MetricValueExtensions
    {
        /// <summary>
        /// Record a new value for this metric.
        /// </summary>
        /// <remarks>
        /// This is a convenience method that will convert the <see cref="TimeSpan"/> to a <see cref="double"/> via
        /// the <see cref="TimeSpan.TotalMilliseconds"/> property.
        /// </remarks>
        /// <param name="metric">The metric to record the value on.</param>
        /// <param name="value">The value to record for this metric.</param>
        public static void RecordValue(this IMetric metric, TimeSpan value) => metric.RecordValue(value.TotalMilliseconds);
    }
}

Logger updates

The aggregate Logger class we return from LoggerFactory.CreateLogger would be updated to support IMetricsLogger, even if none of the ILoggers it aggregates support it. It would only send metrics to the loggers that are known to support the interface.

Define/Record pattern

In order to keep the performance cost as low as possible for recording the actual metric value, the API uses a Define/Record pattern. Consumers should expect to call DefineMetric("foo") once for a particular value of "foo" and store the result as globally as possible (in singleton types, etc.). It is DefineMetric that does the bulk of the work to set up the metric recording. After calling DefineMetric, the RecordValue call on the IMetric interface can choose how to store the value based on the providers needs. For example, in a provider that plans to send pre-aggregated metrics, the RecordValue could simply update rolling aggregate values and then drop the actual recording (thus having constant storage requirements for each metric). If a provider needs to report individual metrics, it can use dynamically-allocated buffers or ring buffers, depending upon the configuration and the needs of the provider. As an example of an IMetric implementation based on pre-aggregating data, see https://gist.github.com/anurse/03c6746204317bd3616c372b4df9dbba

Filtering Metrics

Metrics should be filterable, and they should participate in the existing Logger filtering pipeline. My current proposal is that Metrics are treated like LogLevel.Critical messages, in that if the category is enabled at all, the metrics are written. We could consider capturing a LogLevel in .DefineMetric and using that to provide different "levels" of metrics. Since the existing filters only let you filter by Provider, Category, and Level, it is very difficult to filter out specific metrics by name (as that would require a new Filter structure). This may make LogLevel support for metrics more important.

@pakrym and I talked and came up with a specific suggestion

Metrics can be filtered at the provider and category name, but there is no concept of a LogLevel for metrics. However, filters can disable all metrics for a particular provider/category combination. To do this, we'll add an AllowMetrics boolean to LoggerFilterRule which defaults to true. When writing a metric, we'll check the provider and category name against the filter options as well as checking this boolean.

Scopes and Property Bags

FEEDBACK REQUESTED

My initial thought is that Metrics exist outside of scopes, as they are generally aggregated and Scopes imply per-event data. Providers could, in theory, give their IMetric instances access to read active scope data, and attach them to metrics if they choose (for example, systems that report unaggregated metrics may want to). For similar reasons, property bags (i.e. Log Values/Structured Logging, etc.) don't really make sense for metrics in general as they are usually aggregated. Having said that, many providers do support arbitrary properties (InfluxDB, App Insights), so some level of support may be useful. We could easily add an optional parameter to .RecordValue to allow adding arbitrary properties to each metric, and providers could choose to disregard it if it doesn't make sense for that provider.

Impact on existing Providers

There is minimal impact on existing Providers. Because the interface provides an opt-in model for a provider to receive metrics, existing providers are unaffected. I'd imagine providers like Serilog would not participate in this system as they don't have an infrastructure for metrics (/cc @nblumhardt , I may be wrong!). Providers like App Insights may choose to either add metrics support to their existing providers or add new providers that only handle metrics (and ignore log messages) that are co-registered.

analogrelay commented 7 years ago

/cc @pakrym @davidfowl @glennc @muratg @Eilon

analogrelay commented 7 years ago

PR of initial implementation coming very soon!

nblumhardt commented 7 years ago

Thanks for the ping @anurse! Looks good from a Serilog perspective, AFAICT.

I've done some work with InfluxDB in this area; the single numeric value sounds fine, but the lack of support for key/value tags seems like it will limit the API's usefulness with Influx (not sure about other platforms).

In scenario terms -

analogrelay commented 7 years ago

I want to slice the data by the particular transactions, by customer, by machine name, by database host IP... but still roll up aggregates over any of those dimensions

Yeah, I came to the same conclusion in my explorations with Influx. I think the pattern I'm leaning towards is allowing key-value pairs in RecordValue and allowing Providers to ignore them if they aren't relevant (ditto Scopes).

analogrelay commented 7 years ago

Properties/Scopes are a tricky one to solve well.

  1. We want to be able to record a metric value without requiring an allocation. Thus, we want to use the same generic pattern we using for logging where we accept an arbitrary T (which could be a struct)
  2. However, metrics are usually stashed away somewhere and then flushed to the back-end in a background process. This means structs would have to be boxed (unless you do something like a ring buffer, but even that is a stretch).
  3. Scopes are even trickier because there could be a lot of data stored in the scopes, and much of it may be irrelevant to metrics. For example, the Request ID is not very relevant to metrics (use a log message if you want info on a single request), but Request Path is relevant (as multiple requests can share that attribute). It's impossible for a metric provider to know which fields matter for the metric. The more data associated with the metric, the more impact it has on the performance of the code being measured, which makes it a lot more difficult to use.

I'd lean towards supporting arbitrary properties, but discouraging providers from reading Scope data (we can't stop them from reading it ;)). It does mean that if I want to attach some arbitrary property to an EF metric (for example), I can't do it unless EF exposes a way to do it. Perhaps that's OK for now.

alhardy commented 7 years ago

Are metrics really a logging concern? If I were interested in tracking application metrics would it make sense to have to take a dependency on Microsoft.Extensions.Logging?

Have you guys looked at App Metrics for example?

This reminds me of when I noticed Health Checking functionality being added.

Once I noticed Health Checks being implemented by Microsoft I went to efforts to abstract that out of App Metrics, I don't see any progress on https://github.com/dotnet-architecture/HealthChecks anymore?

muratg commented 7 years ago

@glennc

galvesribeiro commented 7 years ago

@anurse a while ago, I added Telemetry APIs to Orleans (https://github.com/dotnet/orleans/issues/992) which essentially add support to an abstraction that makes both trace/logging and metrics tracking mostly inspired on AI and NewRelic APIs.

Recently we removed the log/tracing from those APIs in favor of use Microsoft.Extensions.Logging abstractions so we would leverage most of the packages (like Serilog) that were already with its plugins for those abstractions.

Now, I'm going to revamp the telemetry APIs v2 and someone from the Orleans core team pointed me this issue out and would be good if we could use those metrics abstractions but I have some concerns about the way it is being implemented:

  1. Why don't make IMetric<T> where T can be int, float, decimal, long? I think using double only is way restrictive.
  2. Why not have it on Microsoft.Extensions.Metrics? I agree with @alhardy that metrics is not a Logging concern. If you think from the APM perspective, both metrics and trace/logging are telemetry-related concerns and they have correlation. So if you want to keep both together, I guess you would need a Microsoft.Extensions.Telemetry package otherwise, keep them separated would be better. I understood you want to leverage the categories infrastructure from MEL, but I guess it can be abstracted in another packages and be referenced by both MEL and MEM.
  3. I'm not sure if Define/Record pattern is a good thing... Metrics carry much more metadata then that. They can or can not be aggregated, batched, auto-flushed or not, etc. There are much more configuration and it must be defined at the method creation time.
  4. Property bags is definitively something that you will want to have support on all methods tracking metrics. Those bags carry context that makes sense for a given metric. i.e. if you are tracking an HttpRequest, you would include all the headers, browser and server details, etc. While tracking an exception, you would need context as well. Essentially an IDictionary<string, object> where object must be JSON serializable.

I would be interested in use it in Orleans but the way it is, it would limit us a lot.

alhardy commented 7 years ago

And here I am asked to remove my dependency on Microsoft.Extensions.Logging and Microsoft.Extensions.DependencyInjection https://github.com/alhardy/AppMetrics/issues/186

I'm not sure why Microsoft is interested in including their own implementation to report metrics to OSS TSDBs like InfluxDB when they already have AppInsights? There doesn't seem to be much support from Microsoft in terms of OSS, wasn't this the idea of the dotnet foundation, rather than including everything possible in the dotnet framework?

App Metrics is my first attempt at an OSS project (other then contributing to other projects), rather than support from Microsoft, equivalent implementations are underway, Metrics and Health Checks. There are also other examples of this type of thing .NET developers have raised recently and in the past.

Microsoft I think has done amazing things all-round recently, very impressed, but the above makes me loose motivation in contributing to the dotnet community :(

Eilon commented 7 years ago

@alhardy we haven't decided to anything just yet! That's why we've logged this issue: to invite community discussion!

Drawaes commented 7 years ago

Here is my thoughts on this as someone who works in a very large corporate environment that has large logging as well as large telegraf/influxdb setups.

  1. Absolutely the metrics need to be "baked" into the framework. So that all framework components will support/use them.
  2. There is plenty of scope for "OSS", in the space of writing the adapters and aggregators
  3. We have lost a bit with windows perf counters going xplat this would help fill that gap
  4. Having to register your metric upfront leads to the problem you defined above around needing arbitrary extra key/value metadata added, instead.....

Make it so you can fire any metric and make the metric name a "namespace" effectively. So your metric maybe something like

http:response:200:/api/blabla|10

Or something like that. Then it's really up to the adapters to make it efficient and smart. In fact if you make the "key" just opaque bytes then you can deal with that however and cache those values.

Anyway just my 2p but this is a very important topic and one that I see as a weak point of c# aspnetcore apps in a big mixed Java/Go/C# ecosystem so I am happy this is being discussed!

alhardy commented 7 years ago

@Drawaes Yeah OK good points, in particular 1 and 3 👍

cwe1ss commented 7 years ago

As always, I'm adding my rant that you should sit together with Vance and the corefx team to not come up with yet another solution for very similar things.

E.g. metrics for operation-based things (like incoming / outgoing http requests) are already covered by Activity. People/metrics providers can just subscribe to activities and therefore re-use all the stuff that it offers (and has been requested above: timing, count, object context, tags).

Don't make us implement yet another abstraction in our frameworks/libraries - 283 lines of instrumentation code (as in the incoming ASP.NET Core request Diagnostics class) is enough. :)

See https://github.com/dotnet/corefx/issues/18114 for more context.

Give Activity some love and make sure it's easily usable and well documented.

People who need other kinds of metrics can always use e.g. AppMetrics which offers much more functionality anyway.

By introducing such a simple metrics abstraction in Microsoft.Extensions.*, you're just creating more Microsoft-internal competition and you're probably killing a great OSS project.

davidfowl commented 7 years ago

@cwe1ss how would you suggest we implement exposing the current connection count in SignalR and the number of websocket messages sent per second?

davidfowl commented 7 years ago

As always, I'm adding my rant that you should sit together with Vance and the corefx team to not come up with yet another solution for very similar things.

We're best buddies with @vancem 😄

cwe1ss commented 7 years ago

I don't know how that's implemented but if you had one activity per connection, subscribers could maybe track the Start and Stop events and do the calculation themselves?!

Am 20.09.2017 um 16:27 schrieb David Fowler notifications@github.com:

@cwe1ss how would you suggest we implement exposing the current connection count in SignalR and the number of websocket messages sent per second?

cwe1ss commented 7 years ago

(This would only cover the Connection count though). Aren't websocket messages per sec something for PerfCounter/EventCounter?

davidfowl commented 7 years ago

@cwe1ss an activity to count the number of concurrent connections? I understand you want a single abstraction but that's just a hack right? The reason that metric system take numbers is because they are counting things. Trying to shoehorn everything into the same API makes a mediocre experience for producers and consumers as we'll end up inventing hidden semantics that can't be expressed in the API.

cwe1ss commented 7 years ago

no, the Activity is used for the instrumentation side to indicate an operation - nothing more, nothing less. Why should I as a framework/library developer care if you want to use something like Zipkin vs something like Prometheus? It's up to the consumers to decide what to do with it (in the DiagnosticSource events) - e.g. Zipkin might create a trace/span, Prometheus might create multiple counters (total, per country, whatever, ...), ...

if creating an activity and running through the subscribers is too much overhead (maybe for msgs in a connection, but surely not for the connection itself, right?) then why not use existing stuff like EventCounter/PerfCounter.

I'm not saying that I like Activity/EventCounter/... 100% but it's what's there in corefx - and it's brand new - so why invent something else??

We've had this discussion before so we probably just have to agree to disagree - just wanted to place my rant again so that I can tell my grandkids one day that it's not my fault that they have to use 20 different Diagnostics things in .NET 😀

cwe1ss commented 7 years ago

PS: I'm not familiar with web sockets. Is the connection duration also an interesting metric? By instrumenting it with Activity you would get that automatically as well.

markrendle commented 7 years ago

It's not at all clear why IMetricsLogger would be an optional, additional interface that ILogger implementations can add. Logs go to ELK or Splunk or Seq wherever. Metrics go to time-series databases (Influx/Prometheus/Graphite) or stats aggregators (statsd/collectd/Telegraf). They're not remotely the same thing.

I can see that it would make sense for the Microsoft Application Insights® ILogger implementation to also accept metrics, because that platform provides both those things, and I know there are other telemetry systems that do likewise, but many of the solutions out there do not. If @alhardy wants to add an implementation of this IMetricsLogger abstraction to App Metrics, where is the sense in him having to provide any kind of implementation of ILogger at all? It's just going to NOOP on Log(...) and return false from IsEnabled and I-don't-know-what on BeginScope().

Yes, there is a degree of commonality regarding points 1 and 2 in @anurse's list (type/namespace bits and so on) but there's a world of difference outside of there.

markrendle commented 7 years ago

@anurse DefineMetric should have a key/value pairs parameter as well.

markrendle commented 7 years ago

Would a Counter metric work by doing RecordValue(1)?

Related: is there a case to be made for having an on-interface RecordValue(long value) overload as well as double? In many of the line formats integers and floats are denoted separately (e.g. 1i vs 1.0).

Drawaes commented 7 years ago

I don't like the define... what if I define it 2x, what if I don't define it before I send a metric?

Also I tend to agree that logging and metrics a separate.. .. like @markrendle called it, we are Kafka/ splunk - telegraf/ influxdb

markrendle commented 7 years ago

I'm guessing that once the metric is defined the first time, it hangs around and gets re-used like Loggers, although that appears to be up to the implementer of IMetricLogger. And I guess that the baked-in meta-implementation of Logger would do that? Something like

private readonly ConcurrentDictionary<string, IMetric> _metrics;

public IMetric DefineMetric(string name, params (string key, string value)[] tags)
{
    return _metrics.GetOrAdd(name, n => CreateNewMetric(n, tags));
}

private IMetric CreateNewMetric(string name, params (string key, string value)[] tags)
{
    var metric = new MetaMetric();
    foreach (var loggerInfo in Loggers)
    {
        if (loggerInfo.Logger is IMetricLogger metricLogger)
        {
            metric.Add(metricLogger.DefineMetric(name, tags));
        }
    }
    return metric;
}

So if the meta-implementation does that, do the actual implementations even need to cache them internally?

Of course now there's a question whether the tags would need to be part of the key that uniquely identifies the metric, and the performance overhead of turning a name plus an arbitrary array of (string,string) tuples into a key...

Drawaes commented 7 years ago

Sure i get what it will do. My question is why leak the "create it and cache it" abstraction. Why can't send any arbitrary key, if it's the first time the it's seen it then create it.

markrendle commented 7 years ago

Well, if you've got a type that's already designed to be a singleton (like an API client wrapper), it's better to define the metrics in the constructor into readonly IMetric instances and then just use them on every call.

Drawaes commented 7 years ago

What if I have a metric for every time a certain http code is returned. Do I have to register say all 1000 numbers or whatever in case I send one? That is my problem with forcing preregistration.

Now you might say I could tracking I made it or not and I certainly could but why should the client app do that or care. One big difference between metrics and logging is that metrics should be a lot more fire and forget... ( hence udp and arbitrary keys in statsd)

galvesribeiro commented 7 years ago

This is what I plan todo for Orleans telemetry APIs v2 in case we don't find the proper abstractions here:

Just as a context, the facet system is somehow a way to configure some of the services that are injected to other types.

public abstract class MetricFacet : Attribute
{
    public string Name { get; set; }
    public bool IsShared { get; set; }
    public bool AutoFlush { get; set; }
}

// Each type of metric would have a facet attribute that would describe how it can be configured
// Since Facets system from was designed with the purpose of offer a common standardized way to 
// inject framework services, I thought it would be a good fit for telemetry/metrics
public class CounterMetricFacet : MetricFacet
{
    // Should aggregate "BatchCount" events before flush (even if AutoFlush == true)
    public bool Aggregate { get; set; }
    public int BatchCount { get; set; } = 10;
}

public class TimeSpanMetricFacet : MetricFacet
{
    // Similar to batch, but only report data when a window is expired.
    public TimeSpan Window { get; set; }
}

// Basic metric interface which all metrics should implement
// Regardless of the type of the metric, they all need to flush it to the registered telemetry consumers
public interface IMetric<T>
{
    Task Flush();
}

// The CounterMetric metric behaves similar to a performance counter, counting an arbitrary number.
// Based on the Facet configuration, it can aggregate a number of events and only flush once.
public interface CounterMetric<T> : IMetric<T>
{
    Task Increment(Dictionary<string, object> meta);
    Task Decrement(Dictionary<string, object> meta);
    Task Increment(T amount, Dictionary<string, object> meta);
    Task Decrement(T amount, Dictionary<string, object> meta);
}

// Another example of metric which would be used to track down some execution times
public interface ITimeSpanMetric : IMetric<TimeSpan>
{
    Task BeginRecording(Dictionary<string, object> meta);
    Task EndRecording(Dictionary<string, object> meta);
}

// A regular grain which receives the metrics injected as Facets
public class MyGrainWithCounter : Grain, IMyGrainWithCounter
{
    private ICounterMetric<int> _myIntCounter;
    private ITimeSpanMetric _myTSMetric;

    public MyGrainWithCounter(
        [CounterMetricFacet(Name = "MyIntCounter", IsShared = true, Aggregate = true, BatchCount = 100)] myIntCounter,
        [TimeSpanMetricFacet(Name = "MyTSMetric", AutoFlush = true, Window = TimeSpan.FromMinutes(1))] myTSMetric)
    {
        _myIntCounter = myIntCounter;
        _myTSMetric = myTSMetric;
    }

    public async Task DoSomethingLong()
    {
        await _myTSMetric.BeginRecording();
        await Task.Delay(100000); //Do something really useful and that takes time...
        await _myTSMetric.EndRecording();
    }

    public async Task AddSomethingForMe(int amount)
    {
        //To some business logic for the grain method and after we are done increment the counter with the value passed by
        await _myIntCounter.Increment(amount);

        // We need the APM tool to receive the metric data right away, so flush it!
        await _myIntCounter.Flush();
    }
}
//The flushing, aggregating, etc, are all configurations that would be specified at the facet attribute,
//so it can be created and injected before the grain is created. That way we would achieve granularity 
//of the configuration per metric, per grain type. 
//The Telemetry consumers would be almost the same as they are today.
//Depending on the metric type, if they are shared, they could leverage some system target that would be
//counting that particular metric and behaving as a backend for it, also making use of the turn-based concurrency model (single-thread guarantee per grain), 
//so no metric writes would be lost.

Not all the aspects on this implementation would fit a general purpose metric abstraction, but they can give a sense of freedom in terms of public developer APIs. For example, the Facets system used in Orleans could be easily replaceable by a MetricsFactory.Create<T>(params) injected from DI.

In other hand, what we call Telemetry Consumer, is nothing more than another abstraction where logging/metric framework developers will plug in their backend implementations (i.e. AI, NewRelic, App Metric, etc.). That way we decouple metric publishing from consuming in the framework.

The IMetric<T> variations would have particular implementation with different semantics. For example, the ICounterMetric would be something similar to how PerfCounters behave, but it wouldn't care how that metric is consumed. It just care of collect/count, pre-aggregate, and flush to the consumer.

Drawaes commented 7 years ago

Thing is I would want aggregation etc to be done outside the code/framework for this. It should be basically a pĺace to fire something into with pretty much no logic, just a standardised API. Perfcounters are both the collector and the write point which always seemed wrong

For instance telegraf would take care of my histogram and sampling intervals and stats for me not inside my code. For this reason ... when you say aggregate... how?

Mean, median, min, max or +- 1 stddev? Or do you mean chunk the sends?

galvesribeiro commented 7 years ago

@Drawaes when I say aggregate, I meant batching. The system may not want to send data every request, they may want to batch several metric write requests before pushing then to the consumers or, they can even do mathematic aggregation locally before push to the consumers.

So yes, chunks of sends, or "local aggregation".

Drawaes commented 7 years ago

But why would that be in the api/icollector [insert thing here] I presume that that thing would be in proc and that should make the decision. Say I used the serilog metric collector (If there was one) I want that to do all that goodness for me not the .net framework component.

cwe1ss commented 7 years ago

Thing is I would want aggregation etc to be done outside the code/framework for this. It should be basically a pĺace to fire something into with pretty much no logic, just a standardised API. Perfcounters are both the collector and the write point which always seemed wrong

That's my point with Activity. It's a "reasonably" low-overhead in-process eventing system where you just say "action x (with property y=z) started at timestamp 1" and "action x (with y=z) stopped at timestamp 2". If a metrics system is implemented as a consumer of this, it can do whatever kind of aggregation, filtering, grouping, sampling it wants. Different consumers can also apply different rules - e.g. the metrics consumer will write metrics for all activities and the tracing consumer will apply sampling. Activity has an IsEnabled check so there's almost no overhead if no consumers are registered. But of course, if there is one, the overhead is higher than using a raw metrics API as it will create the Activity instance etc

vancem commented 7 years ago

There are big structural issues to be resolved here. As @anurse and @cwe1ss indicate, we ALREADY have several logging systems. It is GOOD that this proposal does not suggestion making ANOTHER independent one for metrics, but this proposal is wiring metrics into one particular one (ILogger), so we should at least have a rationale on why that was a good idea.

To get a good handle on creating a rationale, I want to back up a minute and discuss requirements. Why have metrics? It is actually a non-trivial question. In a very real sense all telemetry starts out as events (something happens at a particular time). Eventing systems simply send the message about that with a payload) were metric systems typically do some aggregation and send out far less frequently (e.g. once a second)

This is the fundamental value of metrics: they aggregate. As others have pointed out, you COULD implement a metric system ON TOP OF an eventing system. That is when telemetry points are hit, counts are updated, and then from time to time (sampling), you send out the aggregated number.

The fact that metric can live ON TOP OF an eventing system, is an important insight, as @anurse points out, any new logging/eventing 'pipe' has to deal with how to name it, how to subscribe to it, what the data formats are, how to specify filtering and parsing etc. In short, it is moderately complex, so you would like to reuse it. This is the crux of @cwe1ss 's issue (he would like to subscribe to only one). Thus we would really like to build the metric system on top of an existing eventing system (which is what @anurse is proposing).

But the characteristics of the eventing 'pipe' matter. In particular the ILogger pipe is designed specifically for use in ASP.NET's dependency injection system. This has advantages, but it has a big disadvantage: all telemetry points need to be given 'something' that specifies which ILogger to use (typically through a IServiceCollection object). This is fundamentally problematic for .NET Framework code because we don't want fundamental services (e.g. like HTTP) to have to have dependencies on things like IServiceCollection or have to pass around ILogger instances. This means any metric system that is built on top of ILogger will not have instrumentation points deep in the framework. This a significant disadvantage (@Drawaes points out the value of these 'baked-in' telemetry points).

It is notable that the other two logging systems (EventSource and DiagnosticSource) can be used effectively anywhere in .NET Core, and in fact important points like HTTP request coming into and out of the system are already instrumented.

Indeed there already exists an EventCounter class that is designed to live 'on top' of an EventSource (at the time it was created, DiagnosticSource did not even exist). It is designed to given an EventSource to send its aggregations to, but you could easily imagine it being extended to be bound to a DiagnosticSource (or even an ILogger). It is pretty similar to the IMetric interface suggested above.

This suggests a way forward. Ideally the goal here presumably is to have counters (metrics). What metrics do is to aggregate, and then send aggregated information. Thus you can build this 'on top' of existing logging. We should be doing that. This allows this functionality to be used by any of our logging system (and in particular by the FRAMEWORK code which is likely to be critical).

There are undoubtedly requirements that @anurse implementation was considering that I have not, but I think we do owe it to ourselves to determine how we can reuse/leverage existing infrastructure, and have a rationale for our overall design (that is how does this fit in with the other logging systems we already have, and does it makes sense)

I think if you do that you end up with a metric system that is relatively agnostic to the logging infrastructure beneath it, and lives strictly on top of it.

There is plenty more to say, but this is enough for round one...

@brianrob @stephentoub @geoffkizer @SergeyKanzhelev

SergeyKanzhelev commented 7 years ago

In Application Insights we have metrics which are based on events and it works great when you need a bit of both. In fact as @cwe1ss pointed it works the best for activities as you may decide to expose metrics with numerous dimensions depending on your needs. And those dimensions can be configured without framework code modification by logging system.

@vancem what is missing and what we attempted to solve couple times already is InterlockedIncrement performance-wise metric aggregation. @macrogreg is working on an API right now. Basically the requirement that needs to be addressed is how framework can report a value that can be aggregated with a single atomic operation. Without memory allocation or lookup by name.

Examples may be different queue sizes, cache state and memory utilization metrics.

Another use case I see for a metrics library defined in framework is to be a communication protocol between libraries. Imagine metrics calculated by Application Insights reported to Prometheus or StatsD.

W.r.t. API - dimensions are definitely must have. Also many libraries allow to Record dimension value. It allows to move the logic of dimensions capping and smart allocation of metrics to the logging library. Customer code will not need to worry of defining a separate metric for every dimension.

benaadams commented 7 years ago

what is missing and what we attempted to solve couple times already is InterlockedIncrement performance-wise metric aggregation.

Good place to start is

ThreadLocal<T>(trackAllValues: true)

Then you aren't getting contention when outputting the values, and can periodically aggregate

SergeyKanzhelev commented 7 years ago

@benaadams can you pls elaborate?

analogrelay commented 7 years ago

The feedback on this thread has been really great so far. I want to echo what @Eilon said above though. This is like a first prototype concept. We have a PR with some code, but we all know that code changes and code gets deleted :). Things aren't set in stone and we want to iterate on this design a lot (and sometimes iterating means deleting and starting again ;)).

The primary requirement is to provide a metrics abstraction that the ASP.NET framework can consume and report to. We also want to provide a metrics system that co-operates with the DI system in order to serve the app hosting models used by the Microsoft.Extensions.* systems. EventCounters being tied to EventSources, which are a process-global concept, don't fit well into that as a framework-level (specifically ASP.NET framework) abstraction. So @vancem is right here in that our requirements don't match exactly with the core EventCounter pattern. Having said all that, changing to a design that doesn't mandate a connection with ILogger is certainly valid feedback and I think we'll iterate a bit on that.

That is when telemetry points are hit, counts are updated, and then from time to time (sampling), you send out the aggregated number. - @vancem

This is unfortunately not always the case and the source of a mismatch between EventCounters and the APM tools we're seeing in the wild. Systems like statsd and InfluxDB tend to encourage transmitting unaggregated metric data and allow the consumer to perform the aggregations later, filtering and grouping on specific properties as necessary (which gets to @Drawaes' point on metrics with metadata, which is something we're going to look at integrating into the design). Other systems do expect in-process aggregation, but since it's not consistent how users do this, we need an abstraction that allows for both to happen so that frameworks and libraries can use that abstraction and the decision about where to aggregate can remain with the application developer. This is even more important with the idea that different metrics require different aggregations (various percentiles are important, etc. it's not just a constant set of count, mean, max, min, etc.). We want to build a metrics abstraction that does not impose our way of monitoring, but rather enables others.

In particular the ILogger pipe is designed specifically for use in ASP.NET's dependency injection system. - @vancem

This is precisely the goal of this abstraction. As a producer abstraction, the goal here is to build something that enables ASP.NET (and other Microsoft.Extensions-based) frameworks and applicatons to emit metrics in a way that matches with the eventing system they already use. The use of ILogger reflects this. I've definitely heard the feedback that it feels weird to use ILogger as the place to hang metrics production, and we'll do a little more thinking about this and keep that discussion open. The benefit of having it there is that most components that would want to emit metrics already have an ILogger. And in fact, most metrics are tracking things that are already written as log events. Having said that, I see the argument that while it's useful, it could also be restrictive to tie this to Loggers. Perhaps a system where ILogger can be used to emit metrics, but other abstractions exist as well would be a better design, we'll iterate a bit on that.

This means any metric system that is built on top of ILogger will not have instrumentation points deep in the framework. - @vancem

As we've learned with other diagnostics integrations, this is really not a goal we can reasonably achieve. The Microsoft.Extensions frameworks are very anti-globals as an explicit design goal. The underlying .NET runtimes/framework stuff can't rely on things like DI and thus has more global abstractions (and rightly so). This is why our goal is not to produce an abstraction that the underlying .NET runtimes/framework can use. EventCounter already exists for that. However, EventCounter is not something that fits into our layering because of how global it is, so we need something that abstracts things into a model that does fit.

My question is why leak the "create it and cache it" abstraction - @Drawaes

This is a performance-guided pattern and you may be correct that it's a bit overkill. The idea here is that you want to create and allocate the necessary buffers and structures for the metric in advance so that recording values (which will likely occur in hot paths) can be done very cheaply. Perhaps this is leaking too much abstraction, we can look at iterating a bit on that. I'm a little nervous about requiring metrics providers to have a dictionary lookup on every call to RecordValue though. Perhaps requiring providers that aren't just fire-and-forget (like statsd) to shuffle that work off to background threads is sufficient though.

As to double-calling Define, @markrendle has got it right, we make sure two calls to Define with the same name get the same object. So it would be technically safe and correct to always use .DefineMetric("foo").RecordValue(42) (and you could make an extension method to do that for you).

This is what I plan todo for Orleans telemetry APIs v2 in case we don't find the proper abstractions here - @galvesribeiro

It looks like in your suggestion aggregation and consumption parameters being specified at the place you write the metrics in your pattern. From what I've seen, there's enough variation in how users want to aggregate that it feels cleaner to push the aggregation logic out to the consumption end. Is that not something you've seen in your scenarios? That would be valuable feedback for us :)

Property bags is definitively something that you will want to have support on all methods tracking metrics. - @galvesribeiro

Feedback received ;). I had noticed the same in my investigations into various TSDBs and metrics forwarders. Some systems don't support recording those arbitrary parameters (EventCounters, App Insights discourages it by expecting pre-aggregated data) but we also realized that a pattern where the producer can emit key-value pairs and consumers can decide if they can support them or not.

Why don't make IMetric where T can be int, float, decimal, long? I think using double only is way restrictive. - @galvesribeiro

I'd hesitate to support IMetric<T> as I think many providers would have a lot of trouble with aggregating say a string ;). Providing the ability to use any numeric type and propagate that type down to the provider (so the provider can apply any necessary optimizations) does seem reasonable though. Just have to think of the way to do it (since you can't do where T: int | double | decimal | ... ... yet?)

We also need to dive a bit in to the idea of Metric Types. A number of existing systems provide them, mostly those descended from the Java Metrics library. Things like "Gauges", "Counters", "Meters", etc. What I'm less clear on is the value of specifying these types up-front at the point of production vs. having the consumer control the aggregations they want to perform. For example, with a metric like "Requests per Second", I may want to be able to view both the latest value and the aggregations, as well as histograms. Requiring the producer to define the type up-front seems to limit that ability.

benaadams commented 7 years ago

The output of events is going to be high speed (millions of times a second); the aggregation is going to be slow (once per second) - so the important factor is not having contention when recording - rather than when obtaining aggregate value.

So something like

class Counter
{
    private readonly ThreadLocal<ulong> _counter;

    public Counter()
    {
        _counter = new ThreadLocal<ulong>(true);
    }

    public ulong Value
    {
        get
        {
            ulong value = 0;
            var list = _counter.Values;
            var count = list.Count;
            for (var i = 0; i < count; i++)
            {
                value += list[i];
            }
            return value;
        }
    }

    public void Increment() => _counter.Value++;

    public void Decrement() => _counter.Value--;

    public void Add(ulong value) => _counter.Value += value;

    public void Subtract(ulong value) => _counter.Value -= value;
}
SergeyKanzhelev commented 7 years ago

@benaadams my point was that there should be an API that allows framework to expose counters which are cheap to aggregate. If API will be limited to exposing events with the name and value - lookup or memory allocation will be required.

For all non-that-frequent events @vancem and @cwe1ss suggestion to use eventing system works perfectly fine

analogrelay commented 7 years ago

@benaadams TIL :)

benaadams commented 7 years ago

@anurse you can change it to a double if you like ;-)

analogrelay commented 7 years ago

@benaadams I'd prefer a double double

Drawaes commented 7 years ago

So here is my point of view on what has been said....

@anurse 100% I agree. I think people here are conflating that aggregation and the generation parts of a metrics system. I do not want logic in my application to do this, and if I do I will at an ILogger or whatever that does it. Every logging system I have ever used in massive environments (both commercial and OSS) have had a local "agent" where "all" events are sent. These are then aggregated, summed, stddev, gauged or whatever on the host and sent to a server but NEVER in proc.

There are multiple reasons for this, including you now need all that logic in C# classes that might just not be available. The next of course is you either have to hard code this into your application or build an entire configuration system for you application. If you are using something like Telegraf/statd[insert local agent] then you can config that as an Ops person for your taste.

You might want to aggregate http 500's today because you are getting a lot, but then you fix all your code an hardly ever get them so you don't want any aggregation, this shouldn't be part of the API this is higher up the stack.

What is important is a single consistent API that is easy to hook into, and yes I want it to provide stats from "non" DI stuff in the framework but also from DI'd stuff in ASP.NET. I am sure there is a way to do that, even if it is just static instances loaded of ILoggers or similar, there are enough smart people around this table to figure it out.

As for the double only thing... don't do that, one int type and one float type of whatever bitness is fine. Often you can use delta compression on a stream of int's but if you hit a double rounding problem 0.99999999999999 it makes that delta horrible to encode. In fact often metrics are stored as int's only and have fixed point rather than floating point. So an API that I could define precision would be aces ;)

(For any "aggregation" which is kinda off topic as this should just be pushing values to an ILogger or whatever) Lastly I can't post code due to it not being "open" but there is no reason not to use ThreadLocalStorage on this one. We do it already, and it avoids interlocks, which while "cheap" on large multi cpu systems with high contention (eg loads of the same metric being produced) causes cache thrashing and noticeable impact. It's not an AsyncLocal case because you just wanna stop threads writing the same time.

I really think that "the purpose of metric is aggregation" is not 100% accurate. The purpose is to have light touch information coming out of a system in real time. We have petabytes of logs, and if I want "statistical" reporting or insights I can have those already. What metrics provide is what is happening right now. Anything more than that (Zipkin etc) we drill down into logs.

benaadams commented 7 years ago

Lastly I can't post code due to it not being "open" but there is no reason not to use ThreadLocalStorage on this one. We do it already, and it avoids interlocks, which while "cheap" on large multi cpu systems with high contention (eg loads of the same metric being produced) causes cache thrashing and noticeable impact. It's not an AsyncLocal case because you just wanna stop threads writing the same time.

Is what ThreadLocal<T> gives you; though can cut the allocations .Values etc if using internally and using by a single thread

Drawaes commented 7 years ago

Sorry my mistake I read it on my phone on the crowded tube I though you used async local ;) yeah either thread storage is fine.

(I should have known you wouldn't make that rookie mistake :) )

brianrob commented 7 years ago

There are certainly a number of architectural decisions that need to be made here, but I do like the fact that we start with not adding new logging mechanisms.

One thing that I should point out is that whatever happens at this level of the stack should simply layer on-top of what exists at the layers below - the idea being that anyone who hooks into an existing lower-level tracing mechanism gets this data for free. If something is going to happen in the ILogger API surface, then it is ideally just to make usage at this level of the stack natural and easy. The data just flows towards the lower-level logging mechanisms and then there is one choke point where the data becomes consumable (ideally something like EventListener or DiagnosticListener).

I should also add a plug here for https://github.com/dotnet/designs/pull/15, which a proposed design for performance monitoring. The idea here is the same - it hooks into existing mechanisms and simply provides more data that can be consumed should the consumer have a desire to.

analogrelay commented 7 years ago

As for the double only thing... don't do that, one int type and one float type of whatever bitness is fine.

Feedback definitely noted :). We'll add some kind of different numeric types. Probably we'll add support for all the numeric types. It might mean listeners need to implement multiple RecordValue methods instead of 1, but I think it's worth it (and easily helped by a base-class)

galvesribeiro commented 7 years ago

@anurse

It looks like in your suggestion aggregation and consumption parameters being specified at the place you write the metrics in your pattern. From what I've seen, there's enough variation in how users want to aggregate that it feels cleaner to push the aggregation logic out to the consumption end. Is that not something you've seen in your scenarios? That would be valuable feedback for us :)

The aggregation doesn't happen on the publisher side, it is just a configuration that tell the IMetric<T> whether the publisher should or not flush that metric right away to the consumer. The consumer can also have other logic to windowing/batching before push it to the underlying/external telemetry system/db. The capture (even with aggregation) can happen pretty fast while the consumer is slow.

Looking at Orleans actor implementation (aka Grain) you have the following:

public class MyGrainWithCounter : Grain, IMyGrainWithCounter
{
    private ICounterMetric<int> _myIntCounter;
    private ITimeSpanMetric _myTSMetric;

    public MyGrainWithCounter(
        [CounterMetricFacet(Name = "MyIntCounter", IsShared = true, Aggregate = true, BatchCount = 100)] myIntCounter,
        [TimeSpanMetricFacet(Name = "MyTSMetric", AutoFlush = true, Window = TimeSpan.FromMinutes(1))] myTSMetric)
    {
        _myIntCounter = myIntCounter;
        _myTSMetric = myTSMetric;
    }

    public async Task DoSomethingLong()
    {
        await _myTSMetric.BeginRecording();
        await Task.Delay(100000); //Do something really useful and that takes time...
        await _myTSMetric.EndRecording();
    }

    public async Task AddSomethingForMe(int amount)
    {
        //To some business logic for the grain method and after we are done increment the counter with the value passed by
        await _myIntCounter.Increment(amount);

        // We need the APM tool to receive the metric data right away, so flush it!
        await _myIntCounter.Flush();
    }
}

In this particular case, since ICounterMetric<int> was configured as IsShared that means that it will share the counter resources across multiple actors using that same counter name (ofc by respecting the concurrency by using cheap locks for example). If you set it as IsShared=false that means that particular metric is only for that given actor instance.

That flexibility allow multiple kinds of metrics to be collected regardless if shared or not, and regardless of whether it is batched or not.

So, answer that first question, the aggregation logic can happen in two places. At the object level and at the consumer level. At the object (non-shared metrics) you can aggregate data from a particular instance of the object. At the consumer, it can have many different logic implementation to aggregate and batch it.

but we also realized that a pattern where the producer can emit key-value pairs and consumers can decide if they can support them or not

Yeah, that was my point. If that key-value pairs are going to be used or not, it is up to the consumer implementation.

I'd hesitate to support IMetric as I think many providers would have a lot of trouble with aggregating say a string ;). Providing the ability to use any numeric type and propagate that type down to the provider (so the provider can apply any necessary optimizations) does seem reasonable though. Just have to think of the way to do it (since you can't do where T: int | double | decimal | ... ... yet?)

People use to think that metrics can be achieved with just numbers. I would say that if we compare with logs, they are just strings. Today there is a trend to mess log with contextual data and you have Structured Logs. As much as I don't like to rely on those structured string parsers, I think have the flexibility to declare whatever data structure you want to represent your metrics can have its use. Telemetry (both logs and metrics) can have structured data on it. If you want to create your own consumer that handlers a strongly-typed data structure that holds all the payload content you want for your telemetry, why should we block it?

I know that there is a key-value on each pair, but sometimes, it would be better if we have strong typing to describe the events. If you look at AI SDK you have RequestTelemetry, ExceptionTelemetry, EventTelemetry etc. All them follow a strongly-typed object hierarchy that gives so much meaning to telemetry than just a simple PerfCounter increment.

IMHO, I think people give Telemetry not much attention and resume it to just "PerformanceCounters" and "Plain Logs".

analogrelay commented 7 years ago

People use to think that metrics can be achieved with just numbers. I would say that if we compare with logs, they are just strings.

Do you have an example of a string that is a metric and wouldn't otherwise be represented as an Event? All the possible consumers I've explored (Influx, Statsd, App Insights) only support numeric types for metrics. Metrics are values which are viewed almost entirely as aggregates (yes, as @Drawaes points out you may want to view the current value of the metric, but in general you aggregate). We want to support offloading that aggregation to later processing and thus want to avoid aggregating in-process, but it's still a value that is designed to be aggregated.

I know that there is a key-value on each pair, but sometimes, it would be better if we have strong typing to describe the events

I'm pretty sure @benaadams will hunt me down if I make him allocate objects to write a metric ;). Metrics are often written in the hot paths of an app. Being able to emit a single double on the stack and have it stuffed into a ring buffer or aggregated with other values all without allocating is pretty valuable to making sure metrics can be recorded down all the hot paths.