dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.61k stars 4.56k forks source link

[Feature Request]: Better ILogger sampling and filtering #82465

Open noahfalk opened 1 year ago

noahfalk commented 1 year ago

Problem:

Large scale .NET customers are constantly complaining that logging is too expensive to turn on in production, perf, or stress scenarios. Their external log storage supports sampling of logs to limit storage costs, but that doesn't address the perf impact of collecting and exporting the data from the app.

Each log level adds an order of magnitude to the number of records produced, making collection prohibitive, but not collecting them reduces the chances of detecting problems earlier, or further optimizing performance.

Existing partial solutions:

Better potential solutions:

UPDATE 1/12: I have added a more specific proposed solution in the discussion below

There are several things we might want to look into and they aren't mutually exclusive:

Ideally any/all of the above options would be configurable in code or via reloadable configuration and they are generic enough to work with any log source and sink. SRE may want to use config settings to mitigate an active issue where a production is logging too much or they may want to proactively configure these options to curate their log output. Being able to change configuration on the fly enables rapid exploration of problem hypotheses without requiring taking the service offline to update.

ghost commented 1 year ago

Tagging subscribers to this area: @dotnet/area-extensions-logging See info in area-owners.md if you want to be subscribed.

Issue Details
### Problem: Large scale .NET customers are constantly complaining that logging is too expensive to turn on in production, perf, or stress scenarios. Their external log storage supports sampling of logs to limit storage costs, but that doesn't address the perf impact of collecting and exporting the data from the app. ### Existing partial solutions: - When log statements are created in code controlled by the application developer they could add logic to make the log conditional. For example they might write `if(Activity.Current?.IsRecorded) { _logger.LogInfo(...); }`. This solution doesn't work for log statements in other libraries and it feels awkward to tie logging code from all over to a specific log sampling strategy. - When logging with the console sink developers can use [QueueFullMode](https://learn.microsoft.com/en-us/dotnet/api/microsoft.extensions.logging.console.consoleloggeroptions.queuefullmode?view=dotnet-plat-ext-7.0) to drop log messages that can't be flushed fast enough. This does avoid some performance issues with blocking on console output, but with little control on which messages get preserved the log might be filled with low value messages while important content got dropped. ### Better potential solutions: There are several things we might want to look into and they aren't mutually exclusive: - filtering by event ID - this would allow users to turn on/off specific messages from a library without necessarily seeing all other messages at the same logging level. - random sampling - The app developer can select some % of the total messages they want to keep and each message uses an RNG to decide if it will be filtered out. - request based sampling - Rather than deciding on a message by message basis what should be preserved, this approach tries to decide for an entire request will it be logged or not. This can be an improvement over random sampling because it ensures that when you do see message in the log such as an error or some interesting state you can also find all the other log messages for the same request which lead up to it. - rate limiting - rather than trying to define a specific fraction or type of message to filter in advance, we could conditionally filter messages only when load spikes. This can be a useful safety backstop to ensure that sudden failure or load increase on logging doesn't spillover into other problems because the logging is doing too much IO or using too much CPU. Ideally any/all of the above options would be configurable in code or via reloadable configuration and they are generic enough to work with any log source and sink. Customers may want to use config settings to mitigate an active issue where a production is logging too much or they may want to proactively configure these options to curate their log output.
Author: noahfalk
Assignees: -
Labels: `area-Extensions-Logging`
Milestone: Future
samsp-msft commented 1 year ago

Added some additional comments

Tratcher commented 1 year ago

Ideally all of these would be configurable by category like log level today. E.g. for a random request I want to see the routing, auth, and MVC debug level and higher logs.

reyang commented 1 year ago

SRE may want to use config settings to mitigate an active issue where a production is logging too much or they may want to proactively configure these options to curate their log output. Being able to change configuration on the fly enables rapid exploration of problem hypotheses without requiring taking the service offline to update.

Does this imply that the SRE should be able to get a full list of logs that are available? (e.g. discover all the logger categories/instances)

Tratcher commented 1 year ago

SRE may want to use config settings to mitigate an active issue where a production is logging too much or they may want to proactively configure these options to curate their log output. Being able to change configuration on the fly enables rapid exploration of problem hypotheses without requiring taking the service offline to update.

Does this imply that the SRE should be able to get a full list of logs that are available? (e.g. discover all the logger categories/instances)

More likely they're told which logs to enable for a given feature area or situation.

samsp-msft commented 1 year ago

Discovery of telemetry is a general problem, without good solutions. The same problem exists for Activities, Counters, EventSource, & Logging.

Its further complicated by conditional code in the app that means that log messages will only be sent if that log provider and level are supported.

There is no good way to statically analyze the application and produce a list of log entries, events etc. The main way to discover them seems to be through reading the code, documentation or a tool that will capture all of them and enumerate the results. I wrote this for eventsource that will listen to all the events while running, and then provide details of them. But using it requires modifying the app code.

I theory an SDK like open-telemetry could include a similar feature in it - as a form of collection mode that can be run for a period, that will then produce a report with all the telemetry that could be collected. Its semi-expensive to run in that all the code paths for generating all the telemetry will be run, but its not egressing much data out of the application.

For that reason it would probably a good feature for dotnet-counters, dotnet-trace and dotnet-monitor.

Does the app-insights SDK have any command/control capability from the backend to the app? As this is something I'd expect a n SRE to run occasionally when examining a problem, rather than something the app is written to send?

Maybe this is a case where you'd use one of the dotnet-* tools from a console to interrogate the running app to determine what its emitting.

reyang commented 1 year ago

Discovery of telemetry is a general problem, without good solutions. The same problem exists for Activities, Counters, EventSource, & Logging.

I have different understanding. I think ActivitySource, Meter and EventSource allow folks to inject a "listener", and that listener will get notified for all the existing sources and the new sources that will be created in the future. Which means there is a central way of discovering all the sources.

ILogger has a different philosophy that there is no such central place.

reyang commented 1 year ago

E.g. https://learn.microsoft.com/en-us/dotnet/api/system.diagnostics.tracing.eventlistener?view=net-7.0

image

samsp-msft commented 1 year ago

With the above, it will tell you the sources, but not the events that occur - which is half the value so that you can know what settings to use to see the events fired so you can pick the appropriate settings.

With ILogger, you could create a new provider, and configure it to enumerate the events that are fired, rather than log each of them individually.

reyang commented 1 year ago

With the above, it will tell you the sources, but not the events that occur - which is half the value so that you can know what settings to use to see the events fired so you can pick the appropriate settings.

With ILogger, you could create a new provider, and configure it to enumerate the events that are fired, rather than log each of them individually.

Would you elaborate a bit?

With the listener/source model, it is possible to build a listener that catches all the events, although it might not be the most efficient way, it has provided a mechanism for folks to get what they want if they are willing to pay the cost.

I don't know a way how to get a holistic view of "given an application which uses several 3rd party libs, find all the loggers and hook all the logs/records" unless I use a debugger.

samsp-msft commented 1 year ago

Following the code in https://learn.microsoft.com/en-us/dotnet/core/extensions/custom-logging-provider I would expect to have a new Logger, say LogEntryEnumeratorLogger, that:

It would collect information on each log entry passed through it. It would build up a dictionary with each log message, including:

Then when Dispose is called, it would write out the entries to either the console or a file.

It would need to be enabled via program.cs, in addition to whatever other providers the app wants to use:

var builder = WebApplication.CreateBuilder(args);

builder.Logging.AddLogEntryEnumeratorLogger();

And the configured in AppSettings.json, as to whether to collect the info, and how to output it.

{
  "Logging" : {
    "LogEntryEnumeratorLogger" : {
      "Enabled" : true,
      "OutputTo" : "File",
      "FilePath" : "./log_entries.json"
    }
  }
}

There would be a cost to operate the logger, as it would be building up a structure of all the unique log messages. It wouldn't want to format the messages, just collect the format string itself. If the log entry has an ID, then for efficiency it can assume that all messages with that ID would be the same, so it can shortcut having to compare them all. For messages without an ID, it would compare the format string.

When 3rd party libraries are logging, they should be retrieving the ILogger from DI, and so as long as this was initialized before they are called, then it will be able to capture their entries too.

samsp-msft commented 1 year ago

So it is Friday, and it was a good opportunity to write some code to delve more into Logging implementation, I hacked up a quick demo.

Using https://gist.github.com/samsp-msft/37a5aba3ef7abb5bbadf01ed48477886 and

builder.Logging.AddLogEnumeratorLogger();
...
app.MapGet("/Logs", () => LogEnumeratorLoggerProvider.dumpLogInfo());

and a configuration of

{
  "Logging": {
    "LogLevel": {
      "Default": "Information",
      "Microsoft.AspNetCore": "Warning",
      "System.Net.Http": "Warning"
    },
    "Console": {
      "IncludeScopes": true,
      "LogLevel": {
        "Microsoft": "Warning",
        "Default": "Information"
      }
    },
    "LogEnumerator": {
      "LogLevel": {
        "Default" : "Trace"
      }
    }
  }
}
Category    Level   ID  Message Format  Count
System.Net.Http.HttpClient.Default.ClientHandler    Information 100 Sending HTTP request {HttpMethod} {Uri} 120
System.Net.Http.HttpClient.Default.ClientHandler    Information 101 Received HTTP response headers after {ElapsedMilliseconds}ms - {StatusCode} 119
System.Net.Http.HttpClient.Default.ClientHandler    Trace   102     120
System.Net.Http.HttpClient.Default.ClientHandler    Trace   103     119
Microsoft.AspNetCore.HostFiltering.HostFilteringMiddleware  Trace    2  All hosts are allowed.  4
Microsoft.AspNetCore.HostFiltering.HostFilteringMiddleware  Debug   --  Wildcard detected, all requests with hosts will be allowed. 1
Microsoft.AspNetCore.Routing.EndpointMiddleware Information  1  Executed endpoint '{EndpointName}'  3
Microsoft.AspNetCore.Routing.EndpointMiddleware Information --  Executing endpoint '{EndpointName}' 4
Microsoft.AspNetCore.Diagnostics.DeveloperExceptionPageMiddleware   Error    1  An unhandled exception has occurred while executing the request.    1
Microsoft.AspNetCore.Diagnostics.DeveloperExceptionPageMiddleware   Warning  2  The response has already started, the error page middleware will not be executed.   1
Microsoft.Hosting.Lifetime  Information 14  Now listening on: {address} 2
Microsoft.Hosting.Lifetime  Information --  Hosting environment: {EnvName}  1
Microsoft.Hosting.Lifetime  Information --  Content root path: {ContentRoot}    1
Microsoft.Hosting.Lifetime  Information --  Application started. Press Ctrl+C to shut down. 1
Microsoft.Extensions.Hosting.Internal.Host  Debug    1  Hosting starting    1
Microsoft.Extensions.Hosting.Internal.Host  Debug    2  Hosting started 1
Microsoft.AspNetCore.Routing.Matching.DfaMatcher    Debug   1001    {CandidateCount} candidate(s) found for the request path '{Path}'   4
Microsoft.AspNetCore.Routing.Matching.DfaMatcher    Debug   1005    Endpoint '{Endpoint}' with route pattern '{RoutePattern}' is valid for the request path '{Path}'    2
Microsoft.AspNetCore.Server.Kestrel.Transport.Sockets   Debug    6  Connection id "{ConnectionId}" received FIN.    2
Microsoft.AspNetCore.Server.Kestrel.Transport.Sockets   Debug    7  Connection id "{ConnectionId}" sending FIN because: "{Reason}"  2
Microsoft.AspNetCore.Hosting.Diagnostics    Information  1      4
Microsoft.AspNetCore.Hosting.Diagnostics    Information  2      3
Microsoft.AspNetCore.Hosting.Diagnostics    Debug   13  Loaded hosting startup assembly {assemblyName}  3
System.Net.Http.HttpClient.Default.LogicalHandler   Information 100 Start processing HTTP request {HttpMethod} {Uri}    120
System.Net.Http.HttpClient.Default.LogicalHandler   Information 101 End processing HTTP request after {ElapsedMilliseconds}ms - {StatusCode}    119
System.Net.Http.HttpClient.Default.LogicalHandler   Trace   102     120
System.Net.Http.HttpClient.Default.LogicalHandler   Trace   103     119
Microsoft.AspNetCore.Server.Kestrel.Http2   Debug   35  Trace id "{TraceIdentifier}": HTTP/2 stream error "{error}". A Reset is being sent to the stream.   1
Microsoft.AspNetCore.Server.Kestrel.Http2   Trace   37  Connection id "{ConnectionId}" received {type} frame for stream ID {id} with length {length} and flags {flags}. 9
Microsoft.AspNetCore.Server.Kestrel.Http2   Trace   49  Connection id "{ConnectionId}" sending {type} frame for stream ID {id} with length {length} and flags {flags}.  91
Microsoft.AspNetCore.Server.Kestrel.Https.Internal.HttpsConnectionMiddleware    Debug    1  Failed to authenticate HTTPS connection.    2
Microsoft.AspNetCore.Server.Kestrel.Https.Internal.HttpsConnectionMiddleware    Debug    3  Connection {connectionId} established using the following protocol: {protocol}  1
Microsoft.AspNetCore.Routing.EndpointRoutingMiddleware  Debug    1  Request matched endpoint '{EndpointName}'   4
Microsoft.AspNetCore.Server.Kestrel.Connections Debug    1  Connection id "{ConnectionId}" started. 3
Microsoft.AspNetCore.Server.Kestrel.Connections Debug    2  Connection id "{ConnectionId}" stopped. 2
Microsoft.AspNetCore.Server.Kestrel.Connections Debug   39  Connection id "{ConnectionId}" accepted.    3
LoadTest    Information --  using service url {url} 2
LoadTest    Trace   --  Thread {threadID} making request {count}    120
LoadTest    Information --  Loading service #{index} for {duration} with {threads}  2
Microsoft.AspNetCore.Server.Kestrel Error   13  Connection id "{ConnectionId}", Request id "{TraceIdentifier}": An unhandled exception was thrown by the application.   1
samsp-msft commented 1 year ago

This also makes me question whether we need to have filtering built in - or whether it could be another Logging provider that you would nest other loggers under.

builder.Logging.AddLogFilter( logs =>
{
  logs.AddConsoleLogger();
  Logs.AddXXXLogger();
}

Which would be all managable through config as to what filtering is desired.

cijothomas commented 1 year ago

It would collect information on each log entry passed through it. It would build up a dictionary with each log message, including:

@samsp-msft You'll be to collect information those logs only which were made using the ILogger instances obtained from the same LoggerFactory to which the LoggerProvider was added. I could setup one/many other LoggerFactory instances with different providers, and there is no way to get that information inside the provider you create. (Which is different from Activity/Meter/EventSource - they all allow setting up listeners, so there is way to "hide" anything.)

Tratcher commented 11 months ago

More examples of wanting to emit a specific log from a category: https://github.com/microsoft/reverse-proxy/discussions/2202

halter73 commented 9 months ago

49924 appears to be a special case of this issue.

noahfalk commented 6 months ago

I'd like to propose a narrow and more concrete solution to see if we can move this forward. We would add a new API similar to this (I'm not too concerned with exact naming or signature, the proposal is primarily about the general concept and scope):

public static class LoggingFilter
{
    public static void AddDynamicFilter(
        this ILoggingBuilder builder,
        string category, 
        Func<LogLevel, EventId, TState, Exception?, Func<TState,Exception?,string>, bool> filter);
}

Every time ILogger.Log() gets called on an ILogger produced from this ILoggerFactory it would do: 1) [pre-existing behavior] if IsEnabled() is false return without logging 2) [new behavior] call each dynamic filter delegate with matching category prefix, if any of the filters returns false return without logging. 3) [pre-existing behavior] forward the call to registered ILoggerProviders

Some possible usage examples:

// sample a randomly selected 5% of logs
builder.Logging.AddDynamicFilter(null,
    (level, eventId, state, exception, formatter) => Random.Shared.NextDouble() < 0.05);

// only log messages if there is an active Activity which is marked as Recorded
// (distributed tracing sampling rules determine this)
builder.Logging.AddDynamicFilter(null,
    (level, eventId, state, exception, formatter) => Activity.Current?.IsRecorded);

// eliminate a known noisy log message
builder.Logging.AddDynamicFilter("FooCorp.SomeCategoryName",
    (level, eventId, state, exception, formatter) => eventId != UndesiredEventId);

// rate limiting
// assuming some helper rateLimiter object has already been written that keeps track
// of an incrementing count and resets it on a timer
builder.Logging.AddDynamicFilter(null,
    (level, eventId, state, exception, formatter) => rateLimiter.IsNewMessageWithinLimit());

My immediate focus is less on the details and more on getting some agreement about the scope of solution. In particular I am suggesting these things are out of scope for this feature:

JamesNK commented 6 months ago
public static class LoggingFilter
{
    public static void AddDynamicFilter(
        this ILoggingBuilder builder,
        string category, 
        Func<LogLevel, EventId, TState, Exception?, Func<TState,Exception?,string>, bool> filter);
}

I'm not sure TState would work here. Where does the type information come from?

Being able to access the state is useful. It would require casting it to IReadOnlyList<KeyValuePair<string, object>> but finding a pay-for-play way of doing that is tough.

We don't want to preemptively cast state to a list because that could cause boxing, and state doesn't have to implement the list interface.

Update: Would changing the signature to AddDynamicFilter<TState> be enough?

reyang commented 6 months ago
  1. Do we want category to be the exact category name or it'll support wildcard/prefix? (I think the latter would be more useful, and better aligned with existing pattern; the downside is potential overlap and perf cost).
  2. With AddDynamicFilter, I wonder if folks would need RemoveDynamicFilter or ClearDynamicFilters / RemoveAllDynamicFilters. SetDynamicFilter could be another option. Considering the business need "this would allow users to turn on/off specific messages from a library without necessarily seeing all other messages at the same logging level", removing existing filters could be useful although the user could choose to put the switch/on/off logic in their callbacks.
samsp-msft commented 6 months ago
  1. Do we want category to be the exact category name or it'll support wildcard/prefix? (I think the latter would be more useful, and better aligned with existing pattern; the downside is potential overlap and perf cost).

Through system.configuration, the settings for categories is a prefix match, that would probably be best here. Its not general wildcards, or regex, so should be about as fast as a normal string match.

How will this work with a hierarchy. If one filter is filtering out "foo.bar" and another is trying to filter in "foo.bar.baz" won't the former deny the latter, and so break what the developer is trying to achieve?

Maybe they need to be called from least to most specific, and have the current on/off state included in the parameters. That way more specific filters will still be called, and can make the call based on previous filters in the chain.

noahfalk commented 6 months ago

So far all the feedback sounds like it is refining the design so I'm taking it as people are on-board with the general premise/scope so far. I'm not trying to dive too far into the details yet, but off-the-cuff thoughts for various points raised:

I'm not sure TState would work here. Where does the type information come from?

You are right a delegate is insufficient. We probably need an interface definition where one of the methods has a generic type parameter on it or we could exclude TState from the filtering parameter list.

ILoggerDynamicFilter
{
    bool Filter<TState>(...);
}

Do we want category to be the exact category name or it'll support wildcard/prefix? (I think the latter would be more useful, and better aligned with existing pattern; the downside is potential overlap and perf cost).

I was imagining it would be the latter but omitted this for brevity. I expect all the perf cost is paid at startup when determining which filters apply to which ILogger instances. Once that is determined the category string no longer plays a role and the Logger is just enumerating and invoking a list of filters that were determined to match.

With AddDynamicFilter, I wonder if folks would need RemoveDynamicFilter or ClearDynamicFilters / RemoveAllDynamicFilters.

Yeah we might want those. I'd consider any form of Add/Remove/Clear or any other API that manages the list to be in the scope of the proposal. We'd just have to narrow in on exactly which ones we want.

How will this work with a hierarchy. If one filter is filtering out "foo.bar" and another is trying to filter in "foo.bar.baz" won't the former deny the latter, and so break what the developer is trying to achieve?

My initial thought (which I am not at all tied to) was that we'd enumerate every filter that matched and as soon as one returned false then we short circuit. I think this is an area of the design that would need more investigation and/or feedback.

CodeBlanch commented 6 months ago

Being able to access the state is useful. It would require casting it to IReadOnlyList<KeyValuePair<string, object>> but finding a pay-for-play way of doing that is tough.

Getting into the weeds here a bit but I agree with @JamesNK. Working with TState in the ILogger world is tough.

Imagine a bunch of filters doing...

if (state is IReadOnlyList<KeyValuePair<string, object>> list)
{
    ...
}

That is going to cause a lot of boxing.

One thing I would like to see which might help here is for FormattedLogValues to be made public (could be renamed). That would allow code to be written like:

if (typeof(TState) == typeof(FormattedLogValues))
{
    var typedState = (FormattedLogValues)(object)state;
    ...
}
else if (state is IReadOnlyList<KeyValuePair<string, object>> list)
{
   ...
}

The idea being we could avoid boxing (for these filtering cases but also normal log handling) for the common FormattedLogValues-based log messages. And the JIT should be able to specialize the code.

noahfalk commented 6 months ago

The idea being we could avoid boxing

If you are careful about the exact code pattern you use, the JIT can already optimize away the boxing operation without needing to special case the type in code (Example). In the short term you can change the code to this and avoid the boxing:

if (state is IReadOnlyList<KeyValuePair<string, object>>)
{
   ((IReadOnlyList<KeyValuePair<string, object>>)state).WhateverFunction();
}

In the longer term the JIT is still improving their analysis to recognize the box removal optimization in more code patterns. I wouldn't be surprised if your else if code above compiles without boxing in a near future .NET release. I don't think we need to expose more types to allow people to micro-optimize, we can just let the JIT get better at doing the optimization automatically.

CodeBlanch commented 5 months ago

Interesting thanks for sharing @noahfalk! That sharplab example is 🤯 Improving the JIT to handle this would be 💯 🚀

cijothomas commented 5 months ago

I'd like to propose a narrow and more concrete solution to see if we can move this forward. We would add a new API similar to this (I'm not too concerned with exact naming or signature, the proposal is primarily about the general concept and scope):

public static class LoggingFilter
{
    public static void AddDynamicFilter(
        this ILoggingBuilder builder,
        string category, 
        Func<LogLevel, EventId, TState, Exception?, Func<TState,Exception?,string>, bool> filter);
}

Every time ILogger.Log() gets called on an ILogger produced from this ILoggerFactory it would do:

  1. [pre-existing behavior] if IsEnabled() is false return without logging
  2. [new behavior] call each dynamic filter delegate with matching category prefix, if any of the filters returns false return without logging.
  3. [pre-existing behavior] forward the call to registered ILoggerProviders

Some possible usage examples:

// sample a randomly selected 5% of logs
builder.Logging.AddDynamicFilter(null,
    (level, eventId, state, exception, formatter) => Random.Shared.NextDouble() < 0.05);

// only log messages if there is an active Activity which is marked as Recorded
// (distributed tracing sampling rules determine this)
builder.Logging.AddDynamicFilter(null,
    (level, eventId, state, exception, formatter) => Activity.Current?.IsRecorded);

// eliminate a known noisy log message
builder.Logging.AddDynamicFilter("FooCorp.SomeCategoryName",
    (level, eventId, state, exception, formatter) => eventId != UndesiredEventId);

// rate limiting
// assuming some helper rateLimiter object has already been written that keeps track
// of an incrementing count and resets it on a timer
builder.Logging.AddDynamicFilter(null,
    (level, eventId, state, exception, formatter) => rateLimiter.IsNewMessageWithinLimit());

My immediate focus is less on the details and more on getting some agreement about the scope of solution. In particular I am suggesting these things are out of scope for this feature:

  • Programmatic discovery of different events - there is a lot of discussion above about discovery of categories and events which I think can be useful in its own right, but I don't see it as a hard dependency to do filtering. All of those examples above seem useful, I don't know how someone would implement them today, and they don't rely on any form of dynamic event discovery.
  • Higher level abstractions that define filtering policies or compose them in unique ways - There is certainly room for external libraries (perhaps OpenTelemetry?) to implement a variety of filtering strategies out-of-the-box and provide strongly typed APIs for them, but I don't think there is an immediate need to have such APIs baked into M.E.Logging.Abstractions.
  • Tail-sampling/filtering - Recording logged events into a buffer and then filtering them later can be a powerful technique to only log data about interesting requests (such as slow requests or failed requests). However ILoggerProviders are generally not designed to expect data that has been buffered so implementing this kind of filtering at the ILogger API level is not the ideal place for it. This type of filtering is probably better suited within the implementation of specific ILoggerProviders that are managing buffers of serialized log data, or out-of-process in collection agents and backend processing systems.

This proposal looks good to me. Thank you.