OpenLiberty / open-liberty

Open Liberty is a highly composable, fast to start, dynamic application server runtime environment
https://openliberty.io
Eclipse Public License 2.0
1.15k stars 592 forks source link

MP70: Implement MicroProfile Fault Tolerance 4.1 #27107

Open Emily-Jiang opened 11 months ago

Emily-Jiang commented 11 months ago

Description

MicroProfile Fault Tolerance update to work with MicroProfile Telemetry Metrics as well as MicroProfile Metrics


Documents

When available, add links to required feature documents. Use "N/A" to mark particular documents which are not required by the feature.

General Instructions

The process steps occur roughly in the order as presented. Process steps occasionally overlap.

Each process step has a number of tasks which must be completed or must be marked as not applicable ("N/A").

Unless otherwise indicated, the tasks are the responsibility of the Feature Owner or a Delegate of the Feature Owner.

If you need assistance, reach out to the OpenLiberty/release-architect.

Important: Labels are used to trigger particular steps and must be added as indicated.


Prioritization (Complete Before Development Starts)

The (OpenLiberty/chief-architect) and area leads are responsible for prioritizing the features and determining which features are being actively worked on.

Prioritization

Design preliminaries determine whether a formal design, which will be provided by an Upcoming Feature Overview (UFO) document, must be created and reviewed. A formal design is required if the feature requires any of the following: UI, Serviceability, SVT, Performance testing, or non-trivial documentation/ID. Furthermore, each identified item places a blocking requirement on another team so it must be identified early in the process. The feature owner may check-off the item if they know it doesn't apply, but otherwise they should work with the focal point to determine what work, if any, will be necessary and make them aware of it.

Design Preliminaries

Design

No Design

FAT Documentation

A feature must be prioritized before any implementation work may begin to be delivered (inaccessible/no-ship). However, a design focused approach should still be applied to features, and developers should think about the feature design prior to writing and delivering any code.
Besides being prioritized, a feature must also be socialized (or No Design Approved) before any beta code may be delivered. All new Liberty content must be inaccessible in our GA releases until it is Feature Complete by either marking it kind=noship or beta fencing it.
Code may not GA until this feature has obtained the Design Approved or No Design Approved label, along with all other tasks outlined in the GA section.

Feature Development Begins

Legal and Translation

In order to avoid last minute blockers and significant disruptions to the feature, the legal items need to be done as early in the feature process as possible, either in design or as early into the development as possible. Similarly, translation is to be done concurrently with development. Both MUST be completed before Beta or GA is requested.

Legal (Complete before Feature Complete Date)

Innovation (Complete 1 week before Feature Complete Date)

Translation (Complete by Feature Complete Date)

In order to facilitate early feedback from users, all new features and functionality should first be released as part of a beta release.

Beta Code

Beta Blog (Complete by beta eGA)

A feature is ready to GA after it is Feature Complete and has obtained all necessary Focal Point Approvals.

Feature Complete

Focal Point Approvals (Complete by Feature Complete Date)

These occur only after GA of this feature is requested (by adding a target:ga label). GA of this feature may not occur until all approvals are obtained.

All Features

Design Approved Features

Remove Beta Fencing (Complete by Feature Complete Date)

GA Blog (Complete by Friday after GM)

Post GM (Complete before GA)

Post GA

benjamin-confino commented 4 months ago

Link to UFO: https://ibm.ent.box.com/file/1514251223112

benjamin-confino commented 2 months ago

Link to Feature Test Summary: https://github.com/OpenLiberty/open-liberty/issues/29427

donbourne commented 2 months ago

Serviceability Approval Comment - Please answer the following questions for serviceability approval:

  1. UFO -- does the UFO identify the most likely problems customers will see and identify how the feature will enable them to diagnose and solve those problems without resorting to raising a PMR? Have these issues been addressed in the implementation?

  2. Test and Demo -- As part of the serviceability process we're asking feature teams to test and analyze common problem paths for serviceability and demo those problem paths to someone not involved in the development of the feature (eg. IBM Support, test team, or another development team).
    a) What problem paths were tested and demonstrated? b) Who did you demo to? c) Do the people you demo'd to agree that the serviceability of the demonstrated problem scenarios is sufficient to avoid PMRs for any problems customers are likely to encounter, or that IBM Support should be able to quickly address those problems without need to engage SMEs?

  3. SVT -- SVT team is often the first team to try new features and often encounters problems setting up and using them. Note that we're not expecting SVT to do full serviceability testing -- just to sign-off on the serviceability of the problem paths they encountered. a) Who conducted SVT tests for this feature? b) Do they agree that the serviceability of the problems they encountered is sufficient to avoid PMRs, or that IBM Support should be able to quickly address those problems without need to engage SMEs?

  4. Which IBM Support / SME queues will handle PMRs for this feature? Ensure they are present in the contact reference file and in the queue contact summary, and that the respective IBM Support/SME teams know they are supporting it. Ask Don Bourne if you need links or more info.

  5. Does this feature add any new metrics or emit any new JSON events? If yes, have you updated the JMX metrics reference list / Metrics reference list / JSON log events reference list in the Open Liberty docs?

dmuelle commented 2 months ago

ID review

nlsprops

The number of times the retry logic was run. This will always be once per method call. ---> The number of times the retry logic was run. This value is always equal to once per method call.

The number of times the timeout logic was run. This will usually be once per method call, but may be zero times if the circuit breaker prevents execution or more than once if the method is retried. ---> The number of times the timeout logic was run. This value is typically equal to once per method call. However, it might be zero if the circuit breaker prevents execution or more than once per method call if the method is retried.

The number of times the circuit breaker logic was run. This will usually be once per method call, but may be more than once if the method call is retried. ---> The number of times the circuit breaker logic was run. This value is typically equal to once per method call, but might be more than once if the method call is retried.

note that in these last two messages, one says " if the method is retried" and the next says "if the method call is retried" - if these mean the same thing, I recommend using the former for both.

Amount of time the circuit breaker has spent in each state. ---> Amount of time the circuit breaker spent in each state.

Number of times the circuit breaker has moved from closed state to open state. --> Number of times the circuit breaker moved from closed state to open state.

The number of times the bulkhead logic was run. This will usually be once per method call, but may be zero times if the circuit breaker prevented execution or more than once if the method call is retried. ---> The number of times the bulkhead logic was run. This value is typically equal to once per method call. However, it might be zero if the circuit breaker prevents execution or more than once per method call if the method is retried.

^^ see previous note re method vs method call retried

Number of executions currently waiting in the queue. ---> Number of executions that are currently waiting in the queue.

yasmin-aumeeruddy commented 1 month ago

@OpenLiberty/ste-approvers The STE slides are here: https://ibm.ent.box.com/file/1656122968583

benjamin-confino commented 1 month ago

https://github.com/OpenLiberty/open-liberty/pull/29662 has the ID requested changes

tngiang73 commented 1 month ago

@benjamin-confino : STE looks good. Thanks.

chirp1 commented 1 month ago

From slack with David Mueller that includes Benjamin Confino, the docs for this epic are complete and on draft. Approving the epic.

benjamin-confino commented 3 weeks ago

https://github.com/OpenLiberty/docs/pull/7611 has not been delivered yet, its waiting for closer to release, this updates the liberty docs to include mpTelemetry FT Metrics

NottyCode commented 2 weeks ago

@benjamin-confino the UFO link is private, please update it to be public.

Emily-Jiang commented 2 weeks ago

@benjamin-confino the UFO link is private, please update it to be public.

It was fixed. Sorry about this.

benjamin-confino commented 1 week ago

Serviceability Approval Comment - Please answer the following questions for serviceability approval:

UFO -- does the UFO identify the most likely problems customers will see and identify how the feature will enable them to diagnose and solve those problems without resorting to raising a PMR? Have these issues been addressed in the implementation?

This new code is entirely glue code between two existing features (mpFaultTolerence and mpTelemetry), enable both and this feature will automatically update and start moving data between them. Therefore there are no customer problems that they can diagnose and fix themselves beyond those already covered in mpFaultTolerence-4.0 and mpTelemetry-2.0.

Test and Demo -- As part of the serviceability process we're asking feature teams to test and analyze common problem paths for serviceability and demo those problem paths to someone not involved in the development of the feature (eg. IBM Support, test team, or another development team). a) What problem paths were tested and demonstrated? b) Who did you demo to? c) Do the people you demo'd to agree that the serviceability of the demonstrated problem scenarios is sufficient to avoid PMRs for any problems customers are likely to encounter, or that IBM Support should be able to quickly address those problems without need to engage SMEs?

N/A

SVT -- SVT team is often the first team to try new features and often encounters problems setting up and using them. Note that we're not expecting SVT to do full serviceability testing -- just to sign-off on the serviceability of the problem paths they encountered.
a) Who conducted SVT tests for this feature?
b) Do they agree that the serviceability of the problems they encountered is sufficient to avoid PMRs, or that IBM Support should be able to quickly address those problems without need to engage SMEs?

a) Brian Hanczaryk b) Yes, SVT agrees that the serviceability of any problem encountered was sufficient to avoid PMRs or L2 should be able to quickly address those problems without engaging L3.

Which IBM Support / SME queues will handle PMRs for this feature? Ensure they are present in the contact reference file and in the queue contact summary, and that the respective IBM Support/SME teams know they are supporting it. Ask Don Bourne if you need links or more info.

WL3,CDI

Does this feature add any new metrics or emit any new JSON events? If yes, have you updated the JMX metrics reference list / Metrics reference list / JSON log events reference list in the Open Liberty docs?

This feature does not emit anything, however it provides metrics to OpenTelemetry which OpenTelemetry will then export. The PR to update the Metrics reference list is here: https://github.com/OpenLiberty/docs/pull/7611

@donbourne