OpenLiberty / open-liberty

Open Liberty is a highly composable, fast to start, dynamic application server runtime environment
https://openliberty.io
Eclipse Public License 2.0
1.14k stars 587 forks source link

LG-334: Add HTTP metrics to monitor-1.0, mpMetrics and OpenTelemetry #20985

Closed donbourne closed 2 hours ago

donbourne commented 2 years ago

Description

Open Liberty needs HTTP metrics to allow for:

For mpMetrics, metric naming should be based on rules from OpenTelemetry semantic conventions. Note that work on MP Metrics 5.0 is in progress and that spec could influence the naming of HTTP metrics. see https://opentelemetry.io/docs/reference/specification/metrics/semantic_conventions/http-metrics/#http-server

Metrics for HTTP MUST avoid cardinality explosion. To that end, the use of URL templates is required so that metrics are meaningful and number of distinct metric names is not unbounded.

see also what metrics Spring includes for HTTP: https://www.baeldung.com/micrometer


Documents

When available, add links to required feature documents. Use "N/A" to mark particular documents which are not required by the feature.

General Instructions

The process steps occur roughly in the order as presented. Process steps occasionally overlap.

Each process step has a number of tasks which must be completed or must be marked as not applicable ("N/A").

Unless otherwise indicated, the tasks are the responsibility of the Feature Owner or a Delegate of the Feature Owner.

If you need assistance, reach out to the OpenLiberty/release-architect.

Important: Labels are used to trigger particular steps and must be added as indicated.


Prioritization (Complete Before Development Starts)

The (OpenLiberty/chief-architect) and area leads are responsible for prioritizing the features and determining which features are being actively worked on.

Prioritization

Design preliminaries determine whether a formal design, which will be provided by an Upcoming Feature Overview (UFO) document, must be created and reviewed. A formal design is required if the feature requires any of the following: UI, Serviceability, SVT, Performance testing, or non-trivial documentation/ID.

Design Preliminaries

Design

No Design

FAT Documentation

A feature must be prioritized and socialized (or No Design Approved) before any implementation work may begin and is the minimum before any beta code may be delivered. All new Liberty content must be inaccessible in our GA releases until it is Feature Complete by either marking it kind=noship or beta fencing it.
Code may not GA until this feature has obtained the "Design Approved" or "No Design Approved" label, along with all other tasks outlines in GA section.

Feature Development Begins

Legal and Translation

In order to avoid last minute blockers and significant disruptions to the feature, the legal items need to be done as early in the feature process as possible, either in design or as early into the development as possible. Similarly, translation is to be done concurrently with development. Both MUST be completed before Beta or GA is requested.

Legal (Complete before Feature Complete Date)

Translation (Complete 1 week before Feature Complete Date)

In order to facilitate early feedback from users, all new features and functionality should first be released as part of a beta release.

Beta Code

Beta Blog (Complete 1.5 weeks before beta eGA)

A feature is ready to GA after it is Feature Complete and has obtained all necessary Focal Point Approvals.

Feature Complete

Focal Point Approvals (Complete by Feature Complete Date)

These occur only after GA of this feature is requested (by adding a target:ga label). GA of this feature may not occur until all approvals are obtained.

All Features

Design Approved Features

Remove Beta Fencing (Complete by Feature Complete Date)

GA Blog (Complete by Feature Complete Date)

Post GA

scottkurz commented 5 months ago

2024-03-22 - Comments from UFO review meeting:

  1. slide 3 - add required doc to Communication slide (Fixed)
  2. slide 9 - Problem Statement - add something about wanting to support the Open Telemetry HTTP convention / spec (since it seemed like there were a lot of questions about the difference btw. this new stuff and the existing servlet/rest monitoring/metrics) (Fixed)
  3. slide 15 - diagram mentions "JAX-RS" ...should be "Restful" (Fixed)
  4. slide 28 - Maybe in the slide better differentiate the two new attribute substring value possibilities "HTTP", "HTTPServerRequest". ( It was a bit confusing especially when the part explaining that the "HTTP" value isn't going to do anything additionally today..maybe group/format the possibilities separately instead of in the one single bullet list on the page) (Fixed - original slide is now slide 29, new slide on 30 to address this -> Switched to just use "HTTP", if more Mbeans are introduced in the future, we can introduce more specific names to enable specific Mbeans then )
  5. slide 26 - Use io.openliberty.* as pkg prefix for new API class HttpServerRequestMXBean (Fixed - original slide is now slide 27)
  6. slide 29 - Talk to Gilbert about idea of updating guide(s) perhaps. (Discussed, guides are not updating)
  7. performance concerns .. see chat history and discussion at end - maybe this neesd a design issue? (Performance already executed)

Related to 7. is this issue investigating performance impact of overlapping filters and looking at alt. design. (Design issue discussed here https://github.com/OpenLiberty/open-liberty/issues/28098)

donbourne commented 5 months ago

@Channyboy , UFO link is broken...perhaps forgot to set the link expiry?

NottyCode commented 1 month ago

@Channyboy I have a few questions before approving:

Channyboy commented 1 month ago

@NottyCode

Will this only capture http stats for Jakarta EE 9+ or will it also work for Java EE 7+?

Currently supports JEE10+, backport for EE9 and earlier was intended for later. (Updated page16)

The UFO indicates that an existing problem is we only get metrics for some stats, not for all http, but later it talks about doing metrics for servlets which could read as if this will only work for servlets. Talking to @donbourne he suggests this will capture all http requests, but I don't think that is clear in the UFO

The feature relies on the servlet engine and the use of the servlet filters to capture the necessary information. Any feature that uses it (i.e., for JEE10+ any feature that makes uses io.openliberty.servlet.internal-6.0 and io.openliberty.servlet.internal-6.1) like pages-3.1, restfulWS-3.1, servlet-6.0, xmlWS-4.0 if requests made are captured by the servlet filter then HTTP metrics will be reported for it. (Updated page16)

slide 21

Typo fixed. These will be the same values. network.protocol.name is conditionally required if the version is set and the name is not http. It will always be http, so we can remove this attribute.

donbourne commented 1 month ago

Currently supports JEE10+, backport for EE9 and earlier was intended for later. (Updated page16)

Discussed with @Channyboy that we need this capability back to EE7.

Channyboy commented 1 month ago

@NottyCode UFO updated on page 16 to mention support back to EE7 for mpTelemetry-2.0

dave-waddling commented 4 weeks ago

Sorry, I was a little keen adding the FAT Focal approval! FTS passed review but the testing still needs to go through a the mini-SOE early next week.

donbourne commented 3 weeks ago

OL:

Serviceability Approval Comment - Please answer the following questions for serviceability approval:

  1. UFO -- does the UFO identify the most likely problems customers will see and identify how the feature will enable them to diagnose and solve those problems without resorting to raising a PMR? Have these issues been addressed in the implementation?

  2. Test and Demo -- As part of the serviceability process we're asking feature teams to test and analyze common problem paths for serviceability and demo those problem paths to someone not involved in the development of the feature (eg. IBM Support, test team, or another development team).
    a) What problem paths were tested and demonstrated? b) Who did you demo to? c) Do the people you demo'd to agree that the serviceability of the demonstrated problem scenarios is sufficient to avoid PMRs for any problems customers are likely to encounter, or that IBM Support should be able to quickly address those problems without need to engage SMEs?

  3. SVT -- SVT team is often the first team to try new features and often encounters problems setting up and using them. Note that we're not expecting SVT to do full serviceability testing -- just to sign-off on the serviceability of the problem paths they encountered. a) Who conducted SVT tests for this feature? b) Do they agree that the serviceability of the problems they encountered is sufficient to avoid PMRs, or that IBM Support should be able to quickly address those problems without need to engage SMEs?

  4. Which IBM Support / SME queues will handle PMRs for this feature? Ensure they are present in the contact reference file and in the queue contact summary, and that the respective IBM Support/SME teams know they are supporting it. Ask Don Bourne if you need links or more info.

  5. Does this feature add any new metrics or emit any new JSON events? If yes, have you updated the JMX metrics reference list / Metrics reference list / JSON log events reference list in the Open Liberty docs?

chirp1 commented 3 weeks ago

Approving. David Mueller indicated that he has/will have the info that he needs to make the doc updates.

donbourne commented 3 weeks ago

@clarkek123 will be handling the serviceability approval for this epic.

dave-waddling commented 3 weeks ago

Thanks for completing the FTS. The results from the mini-SOE are good, so adding FAT Focal approval.

jdmcclur commented 3 weeks ago

Approving performance while noting some work may need to be done in the future: https://github.com/OpenLiberty/open-liberty/issues/29396.

Channyboy commented 3 weeks ago

@clarkek123 Serviceability Approval Comment - Please answer the following questions for serviceability approval:

UFO -- does the UFO identify the most likely problems customers will see and identify how the feature will enable them to diagnose and solve those problems without resorting to raising a PMR? Have these issues been addressed in the implementation?

There are no common error scenarios expected as listed in the UFO. However a customer may come across a performance problems or concerns since the underlying monitor-1.0 feature enables all runtime components to create stats/metrics. The customer can customize this by using the filter attribute of the monitor-1.0 configuration element (<monitor>) to explicitly enable the runtime components they want (i.e., only HTTP). This is configuration related to monitor-1.0 separate from this feature. This feature just relies on the appearance of the stats by monitor-1.0 and creating metrics with the provided information. (i.e., synchronizing/reading with available data).

Test and Demo -- As part of the serviceability process we're asking feature teams to test and analyze common problem paths for serviceability and demo those problem paths to someone not involved in the development of the feature (eg. IBM Support, test team, or another development team). a) What problem paths were tested and demonstrated?

As mentioned above, there are no common error scenarios. But will demo the use of the monitor-1.0 filter attribute and the switch between runtime and application configured MP Telemetry behavior (behavior that is part of by parent MP Telemetry 2.0 feature)

b) Who did you demo to?

Install team.

c) Do the people you demo'd to agree that the serviceability of the demonstrated problem scenarios is sufficient to avoid PMRs for any problems customers are likely to encounter, or that IBM Support should be able to quickly address those problems without need to engage SMEs?

Yes.

SVT -- SVT team is often the first team to try new features and often encounters problems setting up and using them. Note that we're not expecting SVT to do full serviceability testing -- just to sign-off on the serviceability of the problem paths they encountered. a) Who conducted SVT tests for this feature?

Dan Guinan

b) Do they agree that the serviceability of the problems they encountered is sufficient to avoid PMRs, or that IBM Support should be able to quickly address those problems without need to engage SMEs?

Yes

Which IBM Support / SME queues will handle PMRs for this feature? Ensure they are present in the contact reference file and in the queue contact summary, and that the respective IBM Support/SME teams know they are supporting it. Ask Don Bourne if you need links or more info.

WAS L3: Metrics

Does this feature add any new metrics or emit any new JSON events? If yes, have you updated the JMX metrics reference list / Metrics reference list / JSON log events reference list in the Open Liberty docs?

Yes

clarkek123 commented 3 weeks ago

@Channyboy I have reviewed the Serviceability information above and have additional questions. The UFO for HTTP Metrics Serviceability page 42 shows N/A, so perhaps there are no common error scenarios that require testing for this feature. Please confirm if that is the case.

The purpose of the Serviceability review is to ensure that common error scenarios have been tested and reviewed by a team other than the local team. For common error scenarios, the feature should provide good messages or responses so that customers can resolve the errors themselves without having to call IBM support. Common errors could be conflicts in configuration parameters, connection errors, feature configured with incompatible JakartaEE or Java level, and so on.

clarkek123 commented 3 weeks ago

I have added serviceability approval based on the information provided in the UFO and template along with the demo to the install team and serviceability review by the SVT team.