dotnet / aspnetcore

ASP.NET Core is a cross-platform .NET framework for building modern cloud-based web applications on Windows, Mac, or Linux.
https://asp.net
MIT License
35.51k stars 10.04k forks source link

ASP.NET Core metrics #47536

Closed JamesNK closed 1 year ago

JamesNK commented 1 year ago

Background and Motivation

ASP.NET Core has event counters. In .NET 8 we want to add metrics counters. These will sit side-by-side with event counters for backward compatibility.

Metrics counters add new features (histograms, tags) that allow data to be represented by fewer counters. For example, there are event counters in hosting for total-requests and failed-requests counters. One metrics counter can represent these with a tag to represent the status.

Proposed API

Microsoft.AspNetCore.Hosting

Notes: HTTP counters and tags here follow OTel's lead: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/semantic_conventions/http-metrics.md#http-server

http-server-current-requests

Name Instrument Type Unit Description
http-server-current-requests UpDownCounter {request} Number of HTTP requests that are currently active on the server.
Attribute Type Description Examples Presence
method string HTTP request method. GET; POST; HEAD Always
scheme string The URI scheme identifying the used protocol. http; https Always
host string Name of the local HTTP server that received the request. localhost Always
port int Port of the local HTTP server that received the request. 8080 Added if not default (80 for http or 443 for https)

http-server-request-duration

Name Instrument Type Unit Description
http-server-request-duration Histogram s The duration of HTTP requests on the server.
Attribute Type Description Examples Presence
scheme string The URI scheme identifying the used protocol. http; https Always
method string HTTP request method. GET; POST; HEAD Always
status-code int HTTP response status code. 200 Always
protocol string HTTP request protocol. HTTP/1.1; HTTP/2; HTTP/3 Always
host string Name of the local HTTP server that received the request. localhost Always
port int Port of the local HTTP server that received the request. 8080 Added if not default (80 for http or 443 for https)
route string The matched route {controller}/{action}/{id?} Added if route endpoint set
exception-name string Name of the .NET exception thrown during the request. Report exception is either unhandled from middleware or handled by ExceptionHandlerMiddleware or DeveloperExceptionPageMiddleware. System.OperationCanceledException If unhandled exception
Custom tags n/a Custom tags added from IHttpMetricsTagsFeature. organization=contoso n/a

Microsoft.AspNetCore.Server.Kestrel

Notes: All Kestrel counters include the endpoint as a tag.

kestrel-current-connections

Name Instrument Type Unit Description
kestrel-current-connections UpDownCounter {connection} Number of connections that are currently active on the server.
Attribute Type Description Examples Presence
endpoint string Name of the local endpoint that received the connection. localhost:8080 Always

kestrel-connection-duration

Name Instrument Type Unit Description
kestrel-connection-duration Histogram s The duration of connections on the server.
Attribute Type Description Examples Presence
endpoint string Name of the local endpoint that received the connection. localhost:8080 Always
exception-name string Name of the .NET exception thrown during the connect. Report exception is unhandled If unhandled exception
Custom tags n/a Custom tags added from IConnectionMetricsTagsFeature. organization=contoso n/a

kestrel-rejected-connections

Name Instrument Type Unit Description
kestrel-rejected-connections Counter {connection} Number of connections rejected by the server. Connections are rejected when the currently active count exceeds the value configured with MaxConcurrentConnections.
Attribute Type Description Examples Presence
endpoint string Name of the local endpoint that received the connection. localhost:8080 Always

kestrel-queued-connections

Name Instrument Type Unit Description
kestrel-queued-connections UpDownCounter {connection} Number of connections that are currently queued and are waiting to start.
Attribute Type Description Examples Presence
endpoint string Name of the local endpoint that received the connection. localhost:8080 Always

kestrel-queued-requests

Name Instrument Type Unit Description
kestrel-queued-requests UpDownCounter {request} Number of HTTP requests on multiplexed connections (HTTP/2 and HTTP/3) that are currently queued and are waiting to start.
Attribute Type Description Examples Presence
endpoint string Name of the local endpoint that received the connection. localhost:8080 Always

kestrel-current-upgraded-connections

Name Instrument Type Unit Description
kestrel-current-upgraded-connections UpDownCounter {request} Number of HTTP connections that are currently upgraded (WebSockets). The number only tracks HTTP/1.1 connections.
Attribute Type Description Examples Presence
endpoint string Name of the local endpoint that received the connection. localhost:8080 Always

kestrel-tls-handshake-duration

Name Instrument Type Unit Description
kestrel-tls-handshake-duration Histogram {s} The duration of TLS handshakes on the server.
Attribute Type Description Examples Presence
endpoint string Name of the local endpoint that received the connection. localhost:8080 Always
protocol string Security protocol used to authenticate the connection. Tls10; Tls11; Tls12; Tls13 Always
exception-name string Name of the .NET exception thrown on TLS handshake failure. System.OperationCanceledException If TLS handshake fails

kestrel-current-tls-handshakes

Name Instrument Type Unit Description
kestrel-current-tls-handshakes UpDownCounter {handshake} Number of TLS handshakes that are currently in progress on the server.
Attribute Type Description Examples Presence
endpoint string Name of the local endpoint that received the connection. localhost:8080 Always

Microsoft.AspNetCore.Http.Connections

Notes: Timed out connection counter is merged into connection-duration counter. I'm unaware of an official connection closed status, so I invented one with some values. @BrennanConroy It would be good if you could help with better end statuses.

signalr-http-transport-current-connections

Name Instrument Type Unit Description
signalr-http-transport-current-connections UpDownCounter {connection} Number of connections that are currently active on the server.

signalr-http-transport-current-transports

Name Instrument Type Unit Description
signalr-http-transport-current-transports UpDownCounter {transport} Number of connection transports that are currently active on the server.
Attribute Type Description Examples Presence
transport string The connection transport None; WebSockets; ServerSentEvents; LongPolling Always

Update: REMOVED

signalr-http-transport-connection-duration

Name Instrument Type Unit Description
signalr-http-transport-connection-duration Histogram s The duration of connections on the server.
Attribute Type Description Examples Presence
status string The connection end status NormalClosure; Timeout; AppShutdown Always
transport string The connection transport None; WebSockets; ServerSentEvents; LongPolling Always

Usage Examples

Alternative Designs

Risks

ghost commented 1 year ago

Thank you for submitting this for API review. This will be reviewed by @dotnet/aspnet-api-review at the next meeting of the ASP.NET Core API Review group. Please ensure you take a look at the API review process documentation and ensure that:

JamesNK commented 1 year ago

@Tratcher @davidfowl @noahfalk @tarekgh @BrennanConroy @samsp-msft

Metrics counters in ASP.NET Core. Covers hosting, Kestrel and SignalR. These are almost all the places counters are used today (there are also some counters in ConcurrencyLimiter, but it's probably being replaced by rate limiting).

davidfowl commented 1 year ago

It is nice to see that we'd have ALOT less counters! How does one look at tls handshake failures? Are we going to do anything to put exception information into metric dimensions?

JamesNK commented 1 year ago

Good question. I forgot exception-name tag on tls-handshake-duration. Updated the list. I chose the exception name because it has a low cardinality.

Exception middleware can add an exception-name tag to request-duration using the feature. That matches R9 HTTP metering middleware functionality. Since we're starting to review these names and tags I'll go ahead and add it now.

reyang commented 1 year ago

@JamesNK regarding the description "Measures the number of concurrent HTTP requests that are currently in-flight.", the words "measure/count/record" have different meaning in the metrics domain, this document gives some context https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/supplementary-guidelines.md#instrument-selection.

noahfalk commented 1 year ago

You may see a performance penalty if the conditionally present attributes cause you to frequently alternate what tags are being provided to the same instrument. The attribute list is defined in the OpenTelemetry spec as an unordered list which requires them to be sorted into a canonical ordering before they can be aggregated. I don't recall if the OTel aggregator did something similar but the aggregator for MetricsEventSource (powering dotnet-counters/dotnet-monitor) has a fast path if the number and order of attributes exactly matches what was given last time. Adding or removing an attribute dynamically makes you miss that fast path and triggers a re-sort. I don't know if the cost rises to the level you will care about it but just something to be aware of if you are benchmarking.

Is there a reason that some places have endpoint as a single tag and others have host+port as two separate tags?

(I'm still poking around, but time for bed, I'll keep looking at it more tomorrow)

JamesNK commented 1 year ago

Is there a reason that some places have endpoint as a single tag and others have host+port as two separate tags?

An HTTP request always has a host and port. The HTTP request counters have them as separate tags.

Meanwhile, a connection has a local endpoint. That could be a DnsEndPoint, which is a host+port. Or it could be a named pipe endpoint with a named pipe path. For a connection counter, we call ToString on the endpoint and use that as the tag. In the case of DnsEndPoint it will be host:port. In the case of NamedPipeEndPoint it will be \\.\pipe\pipename.

Now that you've brought it up, if we go ahead with ConnectionContext.LocalEndPoint.ToString() as a tag value, we'll need to ensure common endpoints cache the ToString value. Don't want to allocate a new string for each measurement.

noahfalk commented 1 year ago

An HTTP request always has a host and port. The HTTP request counters have them as separate tags

Makes sense.

I don't think any of this stuff below changes your plan - its just me thinking aloud because this is our first foray into using Meters and I'm trying to give it due diligence.

I started pondering a little more what this may do to the back-compat story. For tools that explicitly use EventCounters or Meters there is no issue because the APIs for both work fine SxS. For dotnet-counters I tried to avoid having users specify it explicitly by listening to both EventCounters and Meters using the same name. If both respond back (as they would for Microsoft.AspNetCore.Hosting) then the tool is written to automatically prefer using the Meter data and ignoring the EventCounter data. This means users will see a change in behavior by default and I think that is probably OK, but we can give users a way to explicitly opt for the back-compatible EventCounter data if they want it. I expect dotnet-monitor may need something similar. https://github.com/dotnet/diagnostics/issues/3805

We'll also need to figure out how we document our counters now that we are using both APIs. Ideally I'm hoping not to unnecessarily force users to be aware of which API was used to generate metrics data but I think in some cases it is going to be unavoidable. For example I'm hoping we can avoid having different top-level pages for well-known EventCounters vs. well-known Meters but we will probably need to annotate each provider so that users can get that info if they need it.

For the histogram counters the MetricsEventSource currently has pretty low default limits on the number of histograms it will track because each one is quite memory hungry. In the short term people may hit those limits quickly in dotnet-counters/dotnet-monitor. For example the intersection of route+status_code attributes I could easily imagine producing 1000s of combinations in non-trivial apps. We can add options to either reduce memory usage and raise the default limits or to let users more precisely target which dimension values are interesting to them. I don't know whether that work would happen in .NET 8 or not but I think it makes more sense to assume this problem will be alleviated sooner or later rather than to restrict your design based on a point in time constraint.

The metrics themselves seemed pretty reasonable to me. I noticed there might be some small gaps relative to what existed with EventCounters, for example no 'Total Connections Timed Out', but I don't think exact parity is a requirement or goal here. The duration histogram does let people get at how many connections timed-out during each measurement interval and I assume that is the value they would care about far more than the running total since the process started. Worst case adding new metrics in the future based on customer feedback shouldn't be hard and it is better than cluttering the list with things most users won't care about.

JamesNK commented 1 year ago

I noticed there might be some small gaps relative to what existed with EventCounters, for example no 'Total Connections Timed Out', but I don't think exact parity is a requirement or goal here.

I didn't add explicit counters in situations where the number can be figured out from another counter. For example "Total Connections Timed Out" can be calculated by using the number of items recorded by the connection duration counter that have a status=Timeout tag.

For dotnet-counters I tried to avoid having users specify it explicitly by listening to both EventCounters and Meters using the same name.

By the way, Microsoft.AspNetCore.Server.Kestrel meter name is different from the event source name. The event source name used dashes for some reason: Microsoft-AspNetCore-Server-Kestrel. I changed it to be consistent for metrics.

davidfowl commented 1 year ago

The dashes thing is such a pain. I’m Hoping we get to have a clean consistent slate for metrics and event counters can eventually be considered legacy

reyang commented 1 year ago

have a clean consistent slate for metrics and event counters can eventually be considered legacy

+1

noahfalk commented 1 year ago

For example "Total Connections Timed Out" can be calculated by using the number of items recorded by the connection duration counter that have a status=Timeout tag.

I don't think it is an issue and I'm not suggesting you change it, but the number you would get from the histogram would be the number of timed out items during the last measurement interval whereas the original "Total Connections Timed Out" looks like it would have been the running total since the process started. If you have access to all measurement intervals since the app started you could always sum them up, but its possible you won't have access to all the historical measurement data or the added complexity of doing that summation discourages you from doing it. Still, I'm not worried because I doubt the running total was what users would have cared about. I assume it was the rate of change in the total that was important and that they can get readily from the histogram.

By the way, Microsoft.AspNetCore.Server.Kestrel meter name is different from the event source name. The event source name used dashes for some reason: Microsoft-AspNetCore-Server-Kestrel.

Yeah I saw that and I agree that changing it to dots is a good move. Historically it was always convention to name EventSources with dashes and then somewhere around .NET Core 3.0 Vance wanted to change the naming convention and start using dots instead. I assume Microsoft-AspNetCore-Server-Kestrel got named using the older naming convention rather than the newer one but I don't know any of the specifics about exactly why. Perhaps just confusion over which convention to follow when there is more than one.

Hoping we get to have a clean consistent slate for metrics and event counters can eventually be considered legacy

Yep, I think that is the path we are on! We just need to continue adding Meter support to other portions of the stack. In .NET 8 dotnet-monitor should support Meters well so the only significant Microsoft tool I am aware of that still supports EventCounters but not Meters is the AppInsights SDK. As folks move to the newer OTel SDK that gap should diminish in importance.

samsp-msft commented 1 year ago

current-upgraded-requests - is the only scenario that can upgrade a request websockets? Should we be explicit and name this for web sockets, so it doesn't become a problem later if there are other types of upgrades? current-connections - how does http/3 count here? For H1/2 the TCP connection is the key, for H3 they are more virtual correct?

JamesNK commented 1 year ago

current-upgraded-requests - is the only scenario that can upgrade a request websockets?

I'm not familiar with web sockets. @Tratcher @BrennanConroy?

current-connections - how does http/3 count here? For H1/2 the TCP connection is the key, for H3 they are more virtual correct?

QUIC still has the concept of a connection - https://http3-explained.haxx.se/en/quic/quic-connections. The connection counter tracks that. Multiplexed streams on a connection aren't counted. They're tracked as HTTP requests.

davidfowl commented 1 year ago

@JamesNK upgraded connections should be a dimension on current-requests.

BrennanConroy commented 1 year ago

current-upgraded-requests - is the only scenario that can upgrade a request websockets?

Technically no, upgrades can be used for other things. In reality, websockets are the only real usage though.

JamesNK commented 1 year ago

@JamesNK upgraded connections should be a dimension on current-requests.

There is overhead in tags. Too many create a cardinality explosion and also adds to the traffic cost of sending/receiving them. I've tried to limit tags to the core metadata.

Is that information covered in the current-upgraded-requests counter?

JamesNK commented 1 year ago

Updated:

amcasey commented 1 year ago

Typo: "hisogram"

JamesNK commented 1 year ago

API Review Note:

API Approved!

Final meters, counters, tags and their associated information is in the issue body.

JamesNK commented 1 year ago

Added in https://github.com/dotnet/aspnetcore/pull/46834

JamesNK commented 1 year ago

Updated based on changes in https://github.com/dotnet/aspnetcore/issues/48536.