MP70: Implement MicroProfile Fault Tolerance 4.1

Emily-Jiang commented 11 months ago

Description

MicroProfile Fault Tolerance update to work with MicroProfile Telemetry Metrics as well as MicroProfile Metrics

Documents

When available, add links to required feature documents. Use "N/A" to mark particular documents which are not required by the feature.

Externally raised requests for enhancements:
- Aha: Link to the Aha idea (also add a link to this issue in the Aha idea)
- Feature owner adds label Aha idea
- Open Liberty Feature Request: Link to the OL GH issue (also add a link to this issue in the Feature Request GH issue)
- Feature owner adds label Requested feature
UFO: Link to Fault Tolerance 4.1 UFO
FTS: https://github.com/OpenLiberty/open-liberty/issues/29427
Beta Blog: https://github.com/OpenLiberty/open-liberty/issues/29122
GA Blog: Link to GA Blog Post GH Issue
Process Overview
Prioritization
Design
Implementation
Legal and Translation
Beta
GA
- Focal Point Approvals
Other Deliverables

General Instructions

The process steps occur roughly in the order as presented. Process steps occasionally overlap.

Each process step has a number of tasks which must be completed or must be marked as not applicable ("N/A").

Unless otherwise indicated, the tasks are the responsibility of the Feature Owner or a Delegate of the Feature Owner.

If you need assistance, reach out to the OpenLiberty/release-architect.

Important: Labels are used to trigger particular steps and must be added as indicated.

Prioritization (Complete Before Development Starts)

The (OpenLiberty/chief-architect) and area leads are responsible for prioritizing the features and determining which features are being actively worked on.

Prioritization

[x] Feature added to the "New" column of the Open Liberty project board
- Epics can be added to the board in one of two ways:
- From this issue, use the "Projects" section to select the appropriate project board.
- From the appropriate project board click "Add card" and select your Feature Epic issue
[x] Priority assigned
- Attend the Liberty Backlog Prioritization meeting
Design (Complete Before Development Starts)

Design preliminaries determine whether a formal design, which will be provided by an Upcoming Feature Overview (UFO) document, must be created and reviewed. A formal design is required if the feature requires any of the following: UI, Serviceability, SVT, Performance testing, or non-trivial documentation/ID. Furthermore, each identified item places a blocking requirement on another team so it must be identified early in the process. The feature owner may check-off the item if they know it doesn't apply, but otherwise they should work with the focal point to determine what work, if any, will be necessary and make them aware of it.

Design Preliminaries

[x] UI requirements identified, or N/A. (Feature owner and UI focal point)
[x] Accessibility requirements identified, or N/A. (Feature owner and Accessibility focal point)
[x] ID requirements identified, or N/A. (Feature owner and ID focal point)
- Refer to Documenting Open Liberty.
- Feature Owner adds label ID Required, if non-trivial documentation needs to be created by the ID team.
- ID adds label ID Required - Trivial, if no design will be performed and only trivial ID updates are needed.
[x] Serviceability requirements identified, or N/A. (Feature owner and Serviceability focal point)
[x] SVT requirements identified, or N/A. (Feature owner and SVT focal point)
[x] Performance testing requirements identified, or N/A. (Feature owner and Performance focal point)

Design

[x] POC Design / UFO review requested.
- Feature owner adds label Design Review Request
[x] POC Design / UFO review scheduled.
- Follow the instructions in POC-Forum repo
[x] POC Design / UFO review completed.
[x] POC / UFO Review follow-ons completed.
[x] POC Design / UFO approval requested.
- Feature owner adds label Design Approval Request
[x] Design / UFO approved. (OpenLiberty/chief-architect) or N/A
- (OpenLiberty/chief-architect) adds label Design Approved
- Add the public link to the UFO in Box to the Documents section.
- The UFO must always accurately reflect the final implementation of the feature. Any changes must be first approved. Afterwards, update the UFO by creating a copy of the original approved slide(s) at the end of the deck and prepend "OLD" to the title(s). A single updated copy of the slide(s) should take the original's place, and have its title(s) prepended with "UPDATED".

No Design

[ ] No Design requested.
- Feature owner adds label No Design Approval Request
[ ] No Design / No UFO approved. (OpenLiberty/chief-architect) or N/A
- Approver adds label No Design Approved
[ ] Feature / Capability stabilization or discontinuation or N/A
- Feature owner adds label Product Management Approval Request and notifies OpenLiberty/product-management
- Approver adds label Product Management Approved (OpenLiberty/product-management)
- Note: For stabilized, superseded, and discontinued feature/capability, skip the Beta section of the template (you may delete it). Otherwise, proceed as normal.

FAT Documentation

[x] "Feature Test Summary" child task created
- Use the Feature Test Summary Template
- Add FTS issue link to the Documents section.
Implementation

A feature must be prioritized before any implementation work may begin to be delivered (inaccessible/no-ship). However, a design focused approach should still be applied to features, and developers should think about the feature design prior to writing and delivering any code.
Besides being prioritized, a feature must also be socialized (or No Design Approved) before any beta code may be delivered. All new Liberty content must be inaccessible in our GA releases until it is Feature Complete by either marking it kind=noship or beta fencing it.
Code may not GA until this feature has obtained the Design Approved or No Design Approved label, along with all other tasks outlined in the GA section.

Feature Development Begins

[x] Add the In Progress label

Legal and Translation

In order to avoid last minute blockers and significant disruptions to the feature, the legal items need to be done as early in the feature process as possible, either in design or as early into the development as possible. Similarly, translation is to be done concurrently with development. Both MUST be completed before Beta or GA is requested.

Legal (Complete before Feature Complete Date)

[ ] Changed or new open source libraries are cleared and approved, or N/A. (Legal Release Services/Cass Tucker/Release PM).

Innovation (Complete 1 week before Feature Complete Date)

[ ] Consider whether any aspects of the feature may be patentable. If any identified, disclosures have been submitted.

Translation (Complete by Feature Complete Date)

[ ] PII (Program Integrated Information) updates are merged (i.e. all English strings due for translation have been delivered), or N/A.
Beta

In order to facilitate early feedback from users, all new features and functionality should first be released as part of a beta release.

Beta Code

[x] Beta fence the functionality
- E.g. kind=beta, ibm:beta, ProductInfo.getBetaEdition()
[x] Beta development complete and feature ready for inclusion in a beta release
- Add label target:beta and the appropriate target:YY00X-beta (where YY00X is the targeted beta version).
[x] Feature delivered into beta
- (OpenLiberty/release-manager) adds label release:YY00X-beta (where YY00X is the first beta version that included the functionality).

Beta Blog (Complete by beta eGA)

[ ] Beta blog issue created and populated using the Open Liberty BETA blog post template.
- Add a link to the beta blog issue in the Documents section.
- Note: This is for inclusion into the overall beta release blog post. If, in addition, you'd also like to create a dedicated blog post about your feature, then follow the "Standalone Feature Blog Post" instructions under the Other Deliverables section.
GA

A feature is ready to GA after it is Feature Complete and has obtained all necessary Focal Point Approvals.

Feature Complete

[x] Feature implementation and tests completed.
- [x] All PRs are merged.
- [x] All related/child issues are closed.
- [x] All stop ship issues are completed.
[x] Legal: all necessary approvals granted.
[x] Translation: Feature may only proceed to GA if it has either Translation - Complete or Translation - Missing label
- If all translation has been delivered to release branch, feature owner adds label Translation - Complete.
- If missing translation does not cause a break in functionality, nor a security or production outage risk, feature owner adds label Translation - Missing.
- Once all missing translations are delivered, the Translation - Missing label is replaced with Translation - Complete.
- If missing translation could cause a break in functionality or a security or production outage risk, feature owner adds the Translation - Blocked label.
- Featues with Translation - Blocked may NOT proceed to GA until the label has been replaced with either Translation - Missing or Translation - Complete.
- For further guidance, contact Globalization focal point or the Release Architect.
[x] GA development complete and feature ready for inclusion in a GA release
- Add label target:ga and the appropriate target:YY00X (where YY00X is the targeted GA version).
- Inclusion in a release requires the completion of all Focal Point Approvals.

Focal Point Approvals (Complete by Feature Complete Date)

These occur only after GA of this feature is requested (by adding a target:ga label). GA of this feature may not occur until all approvals are obtained.

All Features

[x] APIs/Externals - Externals have been reviewed or N/A. (OpenLiberty/externals-approvers)
- Approver adds label focalApproved:externals
[x] Demo - Demo is scheduled for an upcoming EOI or N/A. (OpenLiberty/demo-approvers)
- Add comment @OpenLiberty/demo-approvers Demo scheduled for EOI [Iteration Number] to this issue.
- Approver adds label focalApproved:demo.
[x] FAT - All Tests complete and running successfully in SOE or N/A. (OpenLiberty/fat-approvers)
- Approver adds label focalApproved:fat.

Design Approved Features

[x] ID - Documentation is complete or N/A. (OpenLiberty/id-approvers)
- Approver adds label focalApproved:id.
- NOTE: If only trivial documentation changes are required, you may reach out to the ID Feature Focal to request a ID Required - Trivial label. Unlike features with regular ID requirement, those with ID Required - Trivial label do not have a hard requirement for a Design/UFO.
[x] InstantOn - InstantOn capable or N/A. (OpenLiberty/instantOn-approvers)
- Approver adds label focalApproved:instantOn.
[x] Performance - Performance testing is complete or N/A. (OpenLiberty/performance-approvers)
- Approver adds label focalApproved:performance.
[x] Serviceability - Serviceability has been addressed or N/A. (OpenLiberty/serviceability-approvers)
- Approver adds label focalApproved:sve.
[x] STE - Skills Transfer Education chart deck is complete or N/A. (OpenLiberty/ste-approvers)
- Approver adds label focalApproved:ste.
[x] SVT - System Verification Test is complete or N/A. (OpenLiberty/svt-approvers)
- Approver adds label focalApproved:svt.

Remove Beta Fencing (Complete by Feature Complete Date)

[ ] Beta guards are removed, or N/A
- Only after all necessary Focal Point Approvals have been granted.

GA Blog (Complete by Friday after GM)

[ ] GA Blog issue created and populated using the Open Liberty GA release blog post template.
- Add a link to the GA Blog issue in the Documents section.
- Note: This is for inclusion into the overall release blog post. If, in addition, you'd also like to create a dedicated blog post about your feature, then follow the "Standalone Feature Blog Post" instructions under the Other Deliverables section.

Post GM (Complete before GA)

[ ] After confirming this feature has been included in the GM driver, feature owner closes this issue.

Post GA

[ ] Remove the target:ga and target:YY00X labels, and add the appropriate release:YY00X. (OpenLiberty/release-manager)
Other Deliverables
[ ] Standalone Feature Blog Post - A blog post specifically about your feature or N/A. (Feature owner and OpenLiberty/release-architect)
- This should be strongly considered for larger or more prominent features.
- Follow instructions in the blogs repo.
[ ] OL Guides - OL Guides assessment is complete or N/A. (OpenLiberty/guide-assessment)
[ ] Dev Experience - Developer Experience & Tools work is complete or N/A. (OpenLiberty/dev-experience-assessment)

benjamin-confino commented 4 months ago

Link to UFO: https://ibm.ent.box.com/file/1514251223112

benjamin-confino commented 2 months ago

Link to Feature Test Summary: https://github.com/OpenLiberty/open-liberty/issues/29427

donbourne commented 2 months ago

Serviceability Approval Comment - Please answer the following questions for serviceability approval:

UFO -- does the UFO identify the most likely problems customers will see and identify how the feature will enable them to diagnose and solve those problems without resorting to raising a PMR? Have these issues been addressed in the implementation?
Test and Demo -- As part of the serviceability process we're asking feature teams to test and analyze common problem paths for serviceability and demo those problem paths to someone not involved in the development of the feature (eg. IBM Support, test team, or another development team).
a) What problem paths were tested and demonstrated? b) Who did you demo to? c) Do the people you demo'd to agree that the serviceability of the demonstrated problem scenarios is sufficient to avoid PMRs for any problems customers are likely to encounter, or that IBM Support should be able to quickly address those problems without need to engage SMEs?
SVT -- SVT team is often the first team to try new features and often encounters problems setting up and using them. Note that we're not expecting SVT to do full serviceability testing -- just to sign-off on the serviceability of the problem paths they encountered. a) Who conducted SVT tests for this feature? b) Do they agree that the serviceability of the problems they encountered is sufficient to avoid PMRs, or that IBM Support should be able to quickly address those problems without need to engage SMEs?
Which IBM Support / SME queues will handle PMRs for this feature? Ensure they are present in the contact reference file and in the queue contact summary, and that the respective IBM Support/SME teams know they are supporting it. Ask Don Bourne if you need links or more info.
Does this feature add any new metrics or emit any new JSON events? If yes, have you updated the JMX metrics reference list / Metrics reference list / JSON log events reference list in the Open Liberty docs?

dmuelle commented 2 months ago

ID review

nlsprops

The number of times the retry logic was run. This will always be once per method call. ---> The number of times the retry logic was run. This value is always equal to once per method call.

The number of times the timeout logic was run. This will usually be once per method call, but may be zero times if the circuit breaker prevents execution or more than once if the method is retried. ---> The number of times the timeout logic was run. This value is typically equal to once per method call. However, it might be zero if the circuit breaker prevents execution or more than once per method call if the method is retried.

The number of times the circuit breaker logic was run. This will usually be once per method call, but may be more than once if the method call is retried. ---> The number of times the circuit breaker logic was run. This value is typically equal to once per method call, but might be more than once if the method call is retried.

note that in these last two messages, one says " if the method is retried" and the next says "if the method call is retried" - if these mean the same thing, I recommend using the former for both.

Amount of time the circuit breaker has spent in each state. ---> Amount of time the circuit breaker spent in each state.

Number of times the circuit breaker has moved from closed state to open state. --> Number of times the circuit breaker moved from closed state to open state.

The number of times the bulkhead logic was run. This will usually be once per method call, but may be zero times if the circuit breaker prevented execution or more than once if the method call is retried. ---> The number of times the bulkhead logic was run. This value is typically equal to once per method call. However, it might be zero if the circuit breaker prevents execution or more than once per method call if the method is retried.

^^ see previous note re method vs method call retried

Number of executions currently waiting in the queue. ---> Number of executions that are currently waiting in the queue.

yasmin-aumeeruddy commented 1 month ago

@OpenLiberty/ste-approvers The STE slides are here: https://ibm.ent.box.com/file/1656122968583

benjamin-confino commented 1 month ago

https://github.com/OpenLiberty/open-liberty/pull/29662 has the ID requested changes

tngiang73 commented 1 month ago

@benjamin-confino : STE looks good. Thanks.

chirp1 commented 1 month ago

From slack with David Mueller that includes Benjamin Confino, the docs for this epic are complete and on draft. Approving the epic.

benjamin-confino commented 3 weeks ago

https://github.com/OpenLiberty/docs/pull/7611 has not been delivered yet, its waiting for closer to release, this updates the liberty docs to include mpTelemetry FT Metrics

NottyCode commented 2 weeks ago

@benjamin-confino the UFO link is private, please update it to be public.

Emily-Jiang commented 2 weeks ago

@benjamin-confino the UFO link is private, please update it to be public.

It was fixed. Sorry about this.

benjamin-confino commented 1 week ago

Serviceability Approval Comment - Please answer the following questions for serviceability approval:

UFO -- does the UFO identify the most likely problems customers will see and identify how the feature will enable them to diagnose and solve those problems without resorting to raising a PMR? Have these issues been addressed in the implementation?

This new code is entirely glue code between two existing features (mpFaultTolerence and mpTelemetry), enable both and this feature will automatically update and start moving data between them. Therefore there are no customer problems that they can diagnose and fix themselves beyond those already covered in mpFaultTolerence-4.0 and mpTelemetry-2.0.

Test and Demo -- As part of the serviceability process we're asking feature teams to test and analyze common problem paths for serviceability and demo those problem paths to someone not involved in the development of the feature (eg. IBM Support, test team, or another development team). a) What problem paths were tested and demonstrated? b) Who did you demo to? c) Do the people you demo'd to agree that the serviceability of the demonstrated problem scenarios is sufficient to avoid PMRs for any problems customers are likely to encounter, or that IBM Support should be able to quickly address those problems without need to engage SMEs?

N/A

SVT -- SVT team is often the first team to try new features and often encounters problems setting up and using them. Note that we're not expecting SVT to do full serviceability testing -- just to sign-off on the serviceability of the problem paths they encountered.
a) Who conducted SVT tests for this feature?
b) Do they agree that the serviceability of the problems they encountered is sufficient to avoid PMRs, or that IBM Support should be able to quickly address those problems without need to engage SMEs?

a) Brian Hanczaryk b) Yes, SVT agrees that the serviceability of any problem encountered was sufficient to avoid PMRs or L2 should be able to quickly address those problems without engaging L3.

Which IBM Support / SME queues will handle PMRs for this feature? Ensure they are present in the contact reference file and in the queue contact summary, and that the respective IBM Support/SME teams know they are supporting it. Ask Don Bourne if you need links or more info.

WL3,CDI

Does this feature add any new metrics or emit any new JSON events? If yes, have you updated the JMX metrics reference list / Metrics reference list / JSON log events reference list in the Open Liberty docs?

This feature does not emit anything, however it provides metrics to OpenTelemetry which OpenTelemetry will then export. The PR to update the Metrics reference list is here: https://github.com/OpenLiberty/docs/pull/7611

@donbourne

OpenLiberty / open-liberty