ASP.NET Core metrics - Githubissues

JamesNK commented 1 year ago

Background and Motivation

ASP.NET Core has event counters. In .NET 8 we want to add metrics counters. These will sit side-by-side with event counters for backward compatibility.

Metrics counters add new features (histograms, tags) that allow data to be represented by fewer counters. For example, there are event counters in hosting for total-requests and failed-requests counters. One metrics counter can represent these with a tag to represent the status.

Proposed API

Microsoft.AspNetCore.Hosting

Notes: HTTP counters and tags here follow OTel's lead: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/semantic_conventions/http-metrics.md#http-server

`http-server-current-requests`

Name	Instrument Type	Unit	Description
`http-server-current-requests`	UpDownCounter	`{request}`	Number of HTTP requests that are currently active on the server.

Attribute	Type	Description	Examples	Presence
`method`	string	HTTP request method.	`GET`; `POST`; `HEAD`	Always
`scheme`	string	The URI scheme identifying the used protocol.	`http`; `https`	Always
`host`	string	Name of the local HTTP server that received the request.	`localhost`	Always
`port`	int	Port of the local HTTP server that received the request.	`8080`	Added if not default (80 for http or 443 for https)

`http-server-request-duration`

Name	Instrument Type	Unit	Description
`http-server-request-duration`	Histogram	`s`	The duration of HTTP requests on the server.

Attribute	Type	Description	Examples	Presence
`scheme`	string	The URI scheme identifying the used protocol.	`http`; `https`	Always
`method`	string	HTTP request method.	`GET`; `POST`; `HEAD`	Always
`status-code`	int	HTTP response status code.	`200`	Always
`protocol`	string	HTTP request protocol.	`HTTP/1.1`; `HTTP/2`; `HTTP/3`	Always
`host`	string	Name of the local HTTP server that received the request.	`localhost`	Always
`port`	int	Port of the local HTTP server that received the request.	`8080`	Added if not default (80 for http or 443 for https)
`route`	string	The matched route	`{controller}/{action}/{id?}`	Added if route endpoint set
`exception-name`	string	Name of the .NET exception thrown during the request. Report exception is either unhandled from middleware or handled by `ExceptionHandlerMiddleware` or `DeveloperExceptionPageMiddleware`.	`System.OperationCanceledException`	If unhandled exception
Custom tags	n/a	Custom tags added from `IHttpMetricsTagsFeature`.	`organization`=`contoso`	n/a

Microsoft.AspNetCore.Server.Kestrel

Notes: All Kestrel counters include the endpoint as a tag.

`kestrel-current-connections`

Name	Instrument Type	Unit	Description
`kestrel-current-connections`	UpDownCounter	`{connection}`	Number of connections that are currently active on the server.

Attribute	Type	Description	Examples	Presence
`endpoint`	string	Name of the local endpoint that received the connection.	`localhost:8080`	Always

`kestrel-connection-duration`

Name	Instrument Type	Unit	Description
`kestrel-connection-duration`	Histogram	`s`	The duration of connections on the server.

Attribute	Type	Description	Examples	Presence
`endpoint`	string	Name of the local endpoint that received the connection.	`localhost:8080`	Always
`exception-name`	string	Name of the .NET exception thrown during the connect. Report exception is unhandled	If unhandled exception
Custom tags	n/a	Custom tags added from `IConnectionMetricsTagsFeature`.	`organization`=`contoso`	n/a

`kestrel-rejected-connections`

Name	Instrument Type	Unit	Description
`kestrel-rejected-connections`	Counter	`{connection}`	Number of connections rejected by the server. Connections are rejected when the currently active count exceeds the value configured with MaxConcurrentConnections.

Attribute	Type	Description	Examples	Presence
`endpoint`	string	Name of the local endpoint that received the connection.	`localhost:8080`	Always

`kestrel-queued-connections`

Name	Instrument Type	Unit	Description
`kestrel-queued-connections`	UpDownCounter	`{connection}`	Number of connections that are currently queued and are waiting to start.

Attribute	Type	Description	Examples	Presence
`endpoint`	string	Name of the local endpoint that received the connection.	`localhost:8080`	Always

`kestrel-queued-requests`

Name	Instrument Type	Unit	Description
`kestrel-queued-requests`	UpDownCounter	`{request}`	Number of HTTP requests on multiplexed connections (HTTP/2 and HTTP/3) that are currently queued and are waiting to start.

Attribute	Type	Description	Examples	Presence
`endpoint`	string	Name of the local endpoint that received the connection.	`localhost:8080`	Always

`kestrel-current-upgraded-connections`

Name	Instrument Type	Unit	Description
`kestrel-current-upgraded-connections`	UpDownCounter	`{request}`	Number of HTTP connections that are currently upgraded (WebSockets). The number only tracks HTTP/1.1 connections.

Attribute	Type	Description	Examples	Presence
`endpoint`	string	Name of the local endpoint that received the connection.	`localhost:8080`	Always

`kestrel-tls-handshake-duration`

Name	Instrument Type	Unit	Description
`kestrel-tls-handshake-duration`	Histogram	`{s}`	The duration of TLS handshakes on the server.

Attribute	Type	Description	Examples	Presence
`endpoint`	string	Name of the local endpoint that received the connection.	`localhost:8080`	Always
`protocol`	string	Security protocol used to authenticate the connection.	`Tls10`; `Tls11`; `Tls12`; `Tls13`	Always
`exception-name`	string	Name of the .NET exception thrown on TLS handshake failure.	`System.OperationCanceledException`	If TLS handshake fails

`kestrel-current-tls-handshakes`

Name	Instrument Type	Unit	Description
`kestrel-current-tls-handshakes`	UpDownCounter	`{handshake}`	Number of TLS handshakes that are currently in progress on the server.

Attribute	Type	Description	Examples	Presence
`endpoint`	string	Name of the local endpoint that received the connection.	`localhost:8080`	Always

Microsoft.AspNetCore.Http.Connections

Notes: Timed out connection counter is merged into connection-duration counter. I'm unaware of an official connection closed status, so I invented one with some values. @BrennanConroy It would be good if you could help with better end statuses.

`signalr-http-transport-current-connections`

Name	Instrument Type	Unit	Description
`signalr-http-transport-current-connections`	UpDownCounter	`{connection}`	Number of connections that are currently active on the server.

`signalr-http-transport-current-transports`

Name	Instrument Type	Unit	Description
`signalr-http-transport-current-transports`	UpDownCounter	`{transport}`	Number of connection transports that are currently active on the server.

Attribute	Type	Description	Examples	Presence
`transport`	string	The connection transport	`None`; `WebSockets`; `ServerSentEvents`; `LongPolling`	Always

Update: REMOVED

`signalr-http-transport-connection-duration`

Name	Instrument Type	Unit	Description
`signalr-http-transport-connection-duration`	Histogram	`s`	The duration of connections on the server.

Attribute	Type	Description	Examples	Presence
`status`	string	The connection end status	`NormalClosure`; `Timeout`; `AppShutdown`	Always
`transport`	string	The connection transport	`None`; `WebSockets`; `ServerSentEvents`; `LongPolling`	Always

Usage Examples

Alternative Designs

Risks

ghost commented 1 year ago

Thank you for submitting this for API review. This will be reviewed by @dotnet/aspnet-api-review at the next meeting of the ASP.NET Core API Review group. Please ensure you take a look at the API review process documentation and ensure that:

The PR contains changes to the reference-assembly that describe the API change. Or, you have included a snippet of reference-assembly-style code that illustrates the API change.
The PR describes the impact to users, both positive (useful new APIs) and negative (breaking changes).
Someone is assigned to "champion" this change in the meeting, and they understand the impact and design of the change.

JamesNK commented 1 year ago

@Tratcher @davidfowl @noahfalk @tarekgh @BrennanConroy @samsp-msft

Metrics counters in ASP.NET Core. Covers hosting, Kestrel and SignalR. These are almost all the places counters are used today (there are also some counters in ConcurrencyLimiter, but it's probably being replaced by rate limiting).

davidfowl commented 1 year ago

It is nice to see that we'd have ALOT less counters! How does one look at tls handshake failures? Are we going to do anything to put exception information into metric dimensions?

JamesNK commented 1 year ago

Good question. I forgot exception-name tag on tls-handshake-duration. Updated the list. I chose the exception name because it has a low cardinality.

Exception middleware can add an exception-name tag to request-duration using the feature. That matches R9 HTTP metering middleware functionality. Since we're starting to review these names and tags I'll go ahead and add it now.

reyang commented 1 year ago

@JamesNK regarding the description "Measures the number of concurrent HTTP requests that are currently in-flight.", the words "measure/count/record" have different meaning in the metrics domain, this document gives some context https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/supplementary-guidelines.md#instrument-selection.

noahfalk commented 1 year ago

You may see a performance penalty if the conditionally present attributes cause you to frequently alternate what tags are being provided to the same instrument. The attribute list is defined in the OpenTelemetry spec as an unordered list which requires them to be sorted into a canonical ordering before they can be aggregated. I don't recall if the OTel aggregator did something similar but the aggregator for MetricsEventSource (powering dotnet-counters/dotnet-monitor) has a fast path if the number and order of attributes exactly matches what was given last time. Adding or removing an attribute dynamically makes you miss that fast path and triggers a re-sort. I don't know if the cost rises to the level you will care about it but just something to be aware of if you are benchmarking.

Is there a reason that some places have endpoint as a single tag and others have host+port as two separate tags?

(I'm still poking around, but time for bed, I'll keep looking at it more tomorrow)

JamesNK commented 1 year ago

Is there a reason that some places have endpoint as a single tag and others have host+port as two separate tags?

An HTTP request always has a host and port. The HTTP request counters have them as separate tags.

Meanwhile, a connection has a local endpoint. That could be a DnsEndPoint, which is a host+port. Or it could be a named pipe endpoint with a named pipe path. For a connection counter, we call ToString on the endpoint and use that as the tag. In the case of DnsEndPoint it will be host:port. In the case of NamedPipeEndPoint it will be \\.\pipe\pipename.

Now that you've brought it up, if we go ahead with ConnectionContext.LocalEndPoint.ToString() as a tag value, we'll need to ensure common endpoints cache the ToString value. Don't want to allocate a new string for each measurement.

noahfalk commented 1 year ago

An HTTP request always has a host and port. The HTTP request counters have them as separate tags

Makes sense.

I don't think any of this stuff below changes your plan - its just me thinking aloud because this is our first foray into using Meters and I'm trying to give it due diligence.

I started pondering a little more what this may do to the back-compat story. For tools that explicitly use EventCounters or Meters there is no issue because the APIs for both work fine SxS. For dotnet-counters I tried to avoid having users specify it explicitly by listening to both EventCounters and Meters using the same name. If both respond back (as they would for Microsoft.AspNetCore.Hosting) then the tool is written to automatically prefer using the Meter data and ignoring the EventCounter data. This means users will see a change in behavior by default and I think that is probably OK, but we can give users a way to explicitly opt for the back-compatible EventCounter data if they want it. I expect dotnet-monitor may need something similar. https://github.com/dotnet/diagnostics/issues/3805

We'll also need to figure out how we document our counters now that we are using both APIs. Ideally I'm hoping not to unnecessarily force users to be aware of which API was used to generate metrics data but I think in some cases it is going to be unavoidable. For example I'm hoping we can avoid having different top-level pages for well-known EventCounters vs. well-known Meters but we will probably need to annotate each provider so that users can get that info if they need it.

For the histogram counters the MetricsEventSource currently has pretty low default limits on the number of histograms it will track because each one is quite memory hungry. In the short term people may hit those limits quickly in dotnet-counters/dotnet-monitor. For example the intersection of route+status_code attributes I could easily imagine producing 1000s of combinations in non-trivial apps. We can add options to either reduce memory usage and raise the default limits or to let users more precisely target which dimension values are interesting to them. I don't know whether that work would happen in .NET 8 or not but I think it makes more sense to assume this problem will be alleviated sooner or later rather than to restrict your design based on a point in time constraint.

The metrics themselves seemed pretty reasonable to me. I noticed there might be some small gaps relative to what existed with EventCounters, for example no 'Total Connections Timed Out', but I don't think exact parity is a requirement or goal here. The duration histogram does let people get at how many connections timed-out during each measurement interval and I assume that is the value they would care about far more than the running total since the process started. Worst case adding new metrics in the future based on customer feedback shouldn't be hard and it is better than cluttering the list with things most users won't care about.

JamesNK commented 1 year ago

I noticed there might be some small gaps relative to what existed with EventCounters, for example no 'Total Connections Timed Out', but I don't think exact parity is a requirement or goal here.

I didn't add explicit counters in situations where the number can be figured out from another counter. For example "Total Connections Timed Out" can be calculated by using the number of items recorded by the connection duration counter that have a status=Timeout tag.

For dotnet-counters I tried to avoid having users specify it explicitly by listening to both EventCounters and Meters using the same name.

By the way, Microsoft.AspNetCore.Server.Kestrel meter name is different from the event source name. The event source name used dashes for some reason: Microsoft-AspNetCore-Server-Kestrel. I changed it to be consistent for metrics.

davidfowl commented 1 year ago

The dashes thing is such a pain. I’m Hoping we get to have a clean consistent slate for metrics and event counters can eventually be considered legacy

reyang commented 1 year ago

have a clean consistent slate for metrics and event counters can eventually be considered legacy

+1

noahfalk commented 1 year ago

For example "Total Connections Timed Out" can be calculated by using the number of items recorded by the connection duration counter that have a status=Timeout tag.

I don't think it is an issue and I'm not suggesting you change it, but the number you would get from the histogram would be the number of timed out items during the last measurement interval whereas the original "Total Connections Timed Out" looks like it would have been the running total since the process started. If you have access to all measurement intervals since the app started you could always sum them up, but its possible you won't have access to all the historical measurement data or the added complexity of doing that summation discourages you from doing it. Still, I'm not worried because I doubt the running total was what users would have cared about. I assume it was the rate of change in the total that was important and that they can get readily from the histogram.

By the way, Microsoft.AspNetCore.Server.Kestrel meter name is different from the event source name. The event source name used dashes for some reason: Microsoft-AspNetCore-Server-Kestrel.

Yeah I saw that and I agree that changing it to dots is a good move. Historically it was always convention to name EventSources with dashes and then somewhere around .NET Core 3.0 Vance wanted to change the naming convention and start using dots instead. I assume Microsoft-AspNetCore-Server-Kestrel got named using the older naming convention rather than the newer one but I don't know any of the specifics about exactly why. Perhaps just confusion over which convention to follow when there is more than one.

Hoping we get to have a clean consistent slate for metrics and event counters can eventually be considered legacy

Yep, I think that is the path we are on! We just need to continue adding Meter support to other portions of the stack. In .NET 8 dotnet-monitor should support Meters well so the only significant Microsoft tool I am aware of that still supports EventCounters but not Meters is the AppInsights SDK. As folks move to the newer OTel SDK that gap should diminish in importance.

samsp-msft commented 1 year ago

current-upgraded-requests - is the only scenario that can upgrade a request websockets? Should we be explicit and name this for web sockets, so it doesn't become a problem later if there are other types of upgrades? current-connections - how does http/3 count here? For H1/2 the TCP connection is the key, for H3 they are more virtual correct?

JamesNK commented 1 year ago

current-upgraded-requests - is the only scenario that can upgrade a request websockets?

I'm not familiar with web sockets. @Tratcher @BrennanConroy?

current-connections - how does http/3 count here? For H1/2 the TCP connection is the key, for H3 they are more virtual correct?

QUIC still has the concept of a connection - https://http3-explained.haxx.se/en/quic/quic-connections. The connection counter tracks that. Multiplexed streams on a connection aren't counted. They're tracked as HTTP requests.

davidfowl commented 1 year ago

@JamesNK upgraded connections should be a dimension on current-requests.

BrennanConroy commented 1 year ago

current-upgraded-requests - is the only scenario that can upgrade a request websockets?

Technically no, upgrades can be used for other things. In reality, websockets are the only real usage though.

JamesNK commented 1 year ago

@JamesNK upgraded connections should be a dimension on current-requests.

There is overhead in tags. Too many create a cardinality explosion and also adds to the traffic cost of sending/receiving them. I've tried to limit tags to the core metadata.

Is that information covered in the current-upgraded-requests counter?

JamesNK commented 1 year ago

Updated:

Improved counter descriptions.
Durations are now seconds.
Exception name is now the full type name.
Added transport to connection-duration for Microsoft.AspNetCore.Http.Connections.

amcasey commented 1 year ago

Typo: "hisogram"

JamesNK commented 1 year ago

API Review Note:

HTTP protocol be on request-duration counter? It's option in OTel spec. There is overhead in adding tags.
- Added protocol to request-duration.
TLS handshake report protocol is useful metadata.
- Added protocol to tls-handshake-duration.
It's useful to know the transports of SignalR current connections. Can't add transport to current-connections because transport negotiation happens after the connection is started.
- Added new current-transports counter. Has transport tag.

API Approved!

Final meters, counters, tags and their associated information is in the issue body.

JamesNK commented 1 year ago

Added in https://github.com/dotnet/aspnetcore/pull/46834

JamesNK commented 1 year ago

Updated based on changes in https://github.com/dotnet/aspnetcore/issues/48536.

dotnet / aspnetcore

ASP.NET Core metrics #47536

Background and Motivation

Proposed API

Microsoft.AspNetCore.Hosting

`http-server-current-requests`

`http-server-request-duration`

Microsoft.AspNetCore.Server.Kestrel

`kestrel-current-connections`

`kestrel-connection-duration`

`kestrel-rejected-connections`

`kestrel-queued-connections`

`kestrel-queued-requests`

`kestrel-current-upgraded-connections`

`kestrel-tls-handshake-duration`

`kestrel-current-tls-handshakes`

Microsoft.AspNetCore.Http.Connections

`signalr-http-transport-current-connections`

`signalr-http-transport-current-transports`

`signalr-http-transport-connection-duration`

Usage Examples

Alternative Designs

Risks