Closed JamesNK closed 1 year ago
Thank you for submitting this for API review. This will be reviewed by @dotnet/aspnet-api-review at the next meeting of the ASP.NET Core API Review group. Please ensure you take a look at the API review process documentation and ensure that:
@Tratcher @davidfowl @noahfalk @tarekgh @BrennanConroy @samsp-msft
Metrics counters in ASP.NET Core. Covers hosting, Kestrel and SignalR. These are almost all the places counters are used today (there are also some counters in ConcurrencyLimiter, but it's probably being replaced by rate limiting).
It is nice to see that we'd have ALOT less counters! How does one look at tls handshake failures? Are we going to do anything to put exception information into metric dimensions?
Good question. I forgot exception-name
tag on tls-handshake-duration
. Updated the list. I chose the exception name because it has a low cardinality.
Exception middleware can add an exception-name
tag to request-duration
using the feature. That matches R9 HTTP metering middleware functionality. Since we're starting to review these names and tags I'll go ahead and add it now.
@JamesNK regarding the description "Measures the number of concurrent HTTP requests that are currently in-flight.", the words "measure/count/record" have different meaning in the metrics domain, this document gives some context https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/supplementary-guidelines.md#instrument-selection.
You may see a performance penalty if the conditionally present attributes cause you to frequently alternate what tags are being provided to the same instrument. The attribute list is defined in the OpenTelemetry spec as an unordered list which requires them to be sorted into a canonical ordering before they can be aggregated. I don't recall if the OTel aggregator did something similar but the aggregator for MetricsEventSource (powering dotnet-counters/dotnet-monitor) has a fast path if the number and order of attributes exactly matches what was given last time. Adding or removing an attribute dynamically makes you miss that fast path and triggers a re-sort. I don't know if the cost rises to the level you will care about it but just something to be aware of if you are benchmarking.
Is there a reason that some places have endpoint as a single tag and others have host+port as two separate tags?
(I'm still poking around, but time for bed, I'll keep looking at it more tomorrow)
Is there a reason that some places have endpoint as a single tag and others have host+port as two separate tags?
An HTTP request always has a host and port. The HTTP request counters have them as separate tags.
Meanwhile, a connection has a local endpoint. That could be a DnsEndPoint
, which is a host+port. Or it could be a named pipe endpoint with a named pipe path. For a connection counter, we call ToString
on the endpoint and use that as the tag. In the case of DnsEndPoint
it will be host:port
. In the case of NamedPipeEndPoint
it will be \\.\pipe\pipename
.
Now that you've brought it up, if we go ahead with ConnectionContext.LocalEndPoint.ToString()
as a tag value, we'll need to ensure common endpoints cache the ToString value. Don't want to allocate a new string for each measurement.
An HTTP request always has a host and port. The HTTP request counters have them as separate tags
Makes sense.
I don't think any of this stuff below changes your plan - its just me thinking aloud because this is our first foray into using Meters and I'm trying to give it due diligence.
I started pondering a little more what this may do to the back-compat story. For tools that explicitly use EventCounters or Meters there is no issue because the APIs for both work fine SxS. For dotnet-counters I tried to avoid having users specify it explicitly by listening to both EventCounters and Meters using the same name. If both respond back (as they would for Microsoft.AspNetCore.Hosting) then the tool is written to automatically prefer using the Meter data and ignoring the EventCounter data. This means users will see a change in behavior by default and I think that is probably OK, but we can give users a way to explicitly opt for the back-compatible EventCounter data if they want it. I expect dotnet-monitor may need something similar. https://github.com/dotnet/diagnostics/issues/3805
We'll also need to figure out how we document our counters now that we are using both APIs. Ideally I'm hoping not to unnecessarily force users to be aware of which API was used to generate metrics data but I think in some cases it is going to be unavoidable. For example I'm hoping we can avoid having different top-level pages for well-known EventCounters vs. well-known Meters but we will probably need to annotate each provider so that users can get that info if they need it.
For the histogram counters the MetricsEventSource currently has pretty low default limits on the number of histograms it will track because each one is quite memory hungry. In the short term people may hit those limits quickly in dotnet-counters/dotnet-monitor. For example the intersection of route+status_code attributes I could easily imagine producing 1000s of combinations in non-trivial apps. We can add options to either reduce memory usage and raise the default limits or to let users more precisely target which dimension values are interesting to them. I don't know whether that work would happen in .NET 8 or not but I think it makes more sense to assume this problem will be alleviated sooner or later rather than to restrict your design based on a point in time constraint.
The metrics themselves seemed pretty reasonable to me. I noticed there might be some small gaps relative to what existed with EventCounters, for example no 'Total Connections Timed Out', but I don't think exact parity is a requirement or goal here. The duration histogram does let people get at how many connections timed-out during each measurement interval and I assume that is the value they would care about far more than the running total since the process started. Worst case adding new metrics in the future based on customer feedback shouldn't be hard and it is better than cluttering the list with things most users won't care about.
I noticed there might be some small gaps relative to what existed with EventCounters, for example no 'Total Connections Timed Out', but I don't think exact parity is a requirement or goal here.
I didn't add explicit counters in situations where the number can be figured out from another counter. For example "Total Connections Timed Out" can be calculated by using the number of items recorded by the connection duration counter that have a status=Timeout
tag.
For dotnet-counters I tried to avoid having users specify it explicitly by listening to both EventCounters and Meters using the same name.
By the way, Microsoft.AspNetCore.Server.Kestrel
meter name is different from the event source name. The event source name used dashes for some reason: Microsoft-AspNetCore-Server-Kestrel
. I changed it to be consistent for metrics.
The dashes thing is such a pain. I’m Hoping we get to have a clean consistent slate for metrics and event counters can eventually be considered legacy
have a clean consistent slate for metrics and event counters can eventually be considered legacy
+1
For example "Total Connections Timed Out" can be calculated by using the number of items recorded by the connection duration counter that have a status=Timeout tag.
I don't think it is an issue and I'm not suggesting you change it, but the number you would get from the histogram would be the number of timed out items during the last measurement interval whereas the original "Total Connections Timed Out" looks like it would have been the running total since the process started. If you have access to all measurement intervals since the app started you could always sum them up, but its possible you won't have access to all the historical measurement data or the added complexity of doing that summation discourages you from doing it. Still, I'm not worried because I doubt the running total was what users would have cared about. I assume it was the rate of change in the total that was important and that they can get readily from the histogram.
By the way, Microsoft.AspNetCore.Server.Kestrel meter name is different from the event source name. The event source name used dashes for some reason: Microsoft-AspNetCore-Server-Kestrel.
Yeah I saw that and I agree that changing it to dots is a good move. Historically it was always convention to name EventSources with dashes and then somewhere around .NET Core 3.0 Vance wanted to change the naming convention and start using dots instead. I assume Microsoft-AspNetCore-Server-Kestrel got named using the older naming convention rather than the newer one but I don't know any of the specifics about exactly why. Perhaps just confusion over which convention to follow when there is more than one.
Hoping we get to have a clean consistent slate for metrics and event counters can eventually be considered legacy
Yep, I think that is the path we are on! We just need to continue adding Meter support to other portions of the stack. In .NET 8 dotnet-monitor should support Meters well so the only significant Microsoft tool I am aware of that still supports EventCounters but not Meters is the AppInsights SDK. As folks move to the newer OTel SDK that gap should diminish in importance.
current-upgraded-requests
- is the only scenario that can upgrade a request websockets? Should we be explicit and name this for web sockets, so it doesn't become a problem later if there are other types of upgrades?
current-connections
- how does http/3 count here? For H1/2 the TCP connection is the key, for H3 they are more virtual correct?
current-upgraded-requests
- is the only scenario that can upgrade a request websockets?
I'm not familiar with web sockets. @Tratcher @BrennanConroy?
current-connections
- how does http/3 count here? For H1/2 the TCP connection is the key, for H3 they are more virtual correct?
QUIC still has the concept of a connection - https://http3-explained.haxx.se/en/quic/quic-connections. The connection counter tracks that. Multiplexed streams on a connection aren't counted. They're tracked as HTTP requests.
@JamesNK upgraded connections should be a dimension on current-requests.
current-upgraded-requests - is the only scenario that can upgrade a request websockets?
Technically no, upgrades can be used for other things. In reality, websockets are the only real usage though.
@JamesNK upgraded connections should be a dimension on current-requests.
There is overhead in tags. Too many create a cardinality explosion and also adds to the traffic cost of sending/receiving them. I've tried to limit tags to the core metadata.
Is that information covered in the current-upgraded-requests
counter?
Updated:
transport
to connection-duration
for Microsoft.AspNetCore.Http.Connections
.Typo: "hisogram"
API Review Note:
request-duration
counter? It's option in OTel spec. There is overhead in adding tags.
protocol
to request-duration
.protocol
to tls-handshake-duration
.current-connections
because transport negotiation happens after the connection is started.
current-transports
counter. Has transport
tag.API Approved!
Final meters, counters, tags and their associated information is in the issue body.
Updated based on changes in https://github.com/dotnet/aspnetcore/issues/48536.
Background and Motivation
ASP.NET Core has event counters. In .NET 8 we want to add metrics counters. These will sit side-by-side with event counters for backward compatibility.
Metrics counters add new features (histograms, tags) that allow data to be represented by fewer counters. For example, there are event counters in hosting for
total-requests
andfailed-requests
counters. One metrics counter can represent these with a tag to represent the status.Proposed API
Microsoft.AspNetCore.Hosting
Notes: HTTP counters and tags here follow OTel's lead: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/semantic_conventions/http-metrics.md#http-server
http-server-current-requests
http-server-current-requests
{request}
method
GET
;POST
;HEAD
scheme
http
;https
host
localhost
port
8080
http-server-request-duration
http-server-request-duration
s
scheme
http
;https
method
GET
;POST
;HEAD
status-code
200
protocol
HTTP/1.1
;HTTP/2
;HTTP/3
host
localhost
port
8080
route
{controller}/{action}/{id?}
exception-name
ExceptionHandlerMiddleware
orDeveloperExceptionPageMiddleware
.System.OperationCanceledException
IHttpMetricsTagsFeature
.organization
=contoso
Microsoft.AspNetCore.Server.Kestrel
Notes: All Kestrel counters include the endpoint as a tag.
kestrel-current-connections
kestrel-current-connections
{connection}
endpoint
localhost:8080
kestrel-connection-duration
kestrel-connection-duration
s
endpoint
localhost:8080
exception-name
IConnectionMetricsTagsFeature
.organization
=contoso
kestrel-rejected-connections
kestrel-rejected-connections
{connection}
endpoint
localhost:8080
kestrel-queued-connections
kestrel-queued-connections
{connection}
endpoint
localhost:8080
kestrel-queued-requests
kestrel-queued-requests
{request}
endpoint
localhost:8080
kestrel-current-upgraded-connections
kestrel-current-upgraded-connections
{request}
endpoint
localhost:8080
kestrel-tls-handshake-duration
kestrel-tls-handshake-duration
{s}
endpoint
localhost:8080
protocol
Tls10
;Tls11
;Tls12
;Tls13
exception-name
System.OperationCanceledException
kestrel-current-tls-handshakes
kestrel-current-tls-handshakes
{handshake}
endpoint
localhost:8080
Microsoft.AspNetCore.Http.Connections
Notes: Timed out connection counter is merged into connection-duration counter. I'm unaware of an official connection closed status, so I invented one with some values. @BrennanConroy It would be good if you could help with better end statuses.
signalr-http-transport-current-connections
signalr-http-transport-current-connections
{connection}
signalr-http-transport-current-transports
signalr-http-transport-current-transports
{transport}
transport
None
;WebSockets
;ServerSentEvents
;LongPolling
Update: REMOVED
signalr-http-transport-connection-duration
signalr-http-transport-connection-duration
s
status
NormalClosure
;Timeout
;AppShutdown
transport
None
;WebSockets
;ServerSentEvents
;LongPolling
Usage Examples
Alternative Designs
Risks