Enable verbose garbage collection by default on IBM Java/Semeru

kgibm commented 1 year ago

Description

By default, verbosegc is not enabled in Liberty (specifically, not enabled by default in Java). This is a problem if a performance or OutOfMemoryError issue occurs as the issue will often need to be reproduced with verbosegc, or users may simply overlook garbage collection performance issues (e.g. thread dumps may point to various application stacks but the underlying issue could be garbage collection). Verbosegc was enabled by default for new profiles in WAS traditional 9.0.0.3 and 9.0.0.4 (z/OS). This epic proposes to enable verbose garbage collection by default on IBM Java/Semeru. Initially discussed in design issue #23001.

Documents

When available, add links to required feature documents. Use "N/A" to mark particular documents which are not required by the feature.

Aha: N/A
UFO: https://ibm.box.com/s/o3isyhh62xixic925g8m7qto5ufgw4nk
FTS: Link to Feature Test Summary GH Issue
Beta Blog: Link to Beta Blog Post GH Issue
GA Blog: Link to GA Blog Post GH Issue
Process Overview
Prioritization
Design
Implementation
Legal and Translation
Beta
GA
- Focal Point Approvals
Other Deliverables

General Instructions

The process steps occur roughly in the order as presented. Process steps occasionally overlap.

Each process step has a number of tasks which must be completed or must be marked as not applicable ("N/A").

Unless otherwise indicated, the tasks are the responsibility of the Feature Owner or a Delegate of the Feature Owner.

If you need assistance, reach out to the OpenLiberty/release-architect.

Important: Labels are used to trigger particular steps and must be added as indicated.

Prioritization (Complete Before Development Starts)

The (OpenLiberty/chief-architect) and area leads are responsible for prioritizing the features and determining which features are being actively worked on.

Prioritization

[x] Feature added to the "New" column of the Open Liberty project board
- Epics can be added to the board in one of two ways:
- From this issue, use the "Projects" section to select the appropriate project board.
- From the appropriate project board click "Add card" and select your Feature Epic issue
[x] Priority assigned
- Attend the Liberty Backlog Prioritization meeting
Design (Complete Before Development Starts)

Design preliminaries determine whether a formal design, which will be provided by an Upcoming Feature Overview (UFO) document, must be created and reviewed. A formal design is required if the feature requires any of the following: UI, Serviceability, SVT, Performance testing, or non-trivial documentation/ID.

Design Preliminaries

[x] UI requirements identified. (Owner and UI focal point)
[x] ID requirements identified. (Owner and ID focal point)
- Refer to Documenting Open Liberty.
- Feature Owner adds label ID Required, if non-trivial documentation needs to be created by the ID team.
- ID adds label ID Required - Trivial, if no design will be performed and only trivial ID updates are needed.
[x] Serviceability Requirements Identified. (Owner and Serviceability focal point)
[x] SVT Requirements Identified. (Owner and SVT focal point)
[x] Performance testing requirements identified. (Owner and Performance focal point)

Design

[x] POC Design / UFO review requested.
- Owner adds label Design Review Request
[x] POC Design / UFO review scheduled.
- Follow the instructions in POC-Forum repo
[x] POC Design / UFO review completed.
[x] POC / UFO Review follow-ons completed.
[x] POC Design / UFO approval requested.
- Owner adds label Design Approval Request
[x] Design / UFO approved. (OpenLiberty/chief-architect) or N/A
- (OpenLiberty/chief-architect) adds label Design Approved
- Add the public link to the UFO in Box to the Documents section.
- The UFO must always accurately reflect the final implementation of the feature. Any changes must be first approved. Afterwards, update the UFO by creating a copy of the original approved slide(s) at the end of the deck and prepend "OLD" to the title(s). A single updated copy of the slide(s) should take the original's place, and have its title(s) prepended with "UPDATED".

No Design

[ ] No Design requested.
- Owner adds label No Design Approval Request
[ ] No Design / No UFO approved. (OpenLiberty/chief-architect) or N/A
- Approver adds label No Design Approved
[ ] Feature / Capability stabilization or discontinuation or N/A
- Owner adds label Product Management Approval Request and notifies OpenLiberty/product-management
- Approver adds label Product Management Approved (OpenLiberty/product-management)
- Note: For stabilized, superseded, and discontinued feature/capability, skip the Beta section of the template (you may delete it). Otherwise, proceed as normal.

FAT Documentation

[x] "Feature Test Summary" child task created
- Use the Feature Test Summary Template
- Add FTS issue link to the Documents section.
Implementation

A feature must be prioritized before any implementation work may begin to be delivered (inaccessible/no-ship). However, a design focused approach should still be applied to features, and developers should think about the feature design prior to writing and delivering any code.
Besides being prioritized, a feature must also be socialized (or No Design Approved) before any beta code may be delivered. All new Liberty content must be inaccessible in our GA releases until it is Feature Complete by either marking it kind=noship or beta fencing it.
Code may not GA until this feature has obtained the "Design Approved" or "No Design Approved" label, along with all other tasks outlined in the GA section.

Feature Development Begins

[x] Add the In Progress label

Legal and Translation

In order to avoid last minute blockers and significant disruptions to the feature, the legal items need to be done as early in the feature process as possible, either in design or as early into the development as possible. Similarly, translation is to be done concurrently with development. Both MUST be completed before Beta or GA is requested.

Legal (Complete before Feature Complete Date)

[ ] Changed or new open source libraries are cleared and approved, or N/A. (Legal Release Services/Cass Tucker/Release PM).

Innovation (Complete 1 week before Feature Complete Date)

[ ] Consider whether any aspects of the feature may be patentable. If any identified, disclosures have been submitted.

Translation (Complete by Feature Complete Date)

[ ] PII (Program Integrated Information) updates are merged (i.e. all English strings due for translation have been delivered), or N/A.
Beta

In order to facilitate early feedback from users, all new features and functionality should first be released as part of a beta release.

Beta Code

[ ] Beta fence the functionality
- E.g. kind=beta, ibm:beta, ProductInfo.getBetaEdition()
[ ] Beta development complete and feature ready for inclusion in a beta release
- Add label target:beta and the appropriate target:YY00X-beta (where YY00X is the targeted beta version).
[ ] Feature delivered into beta
- (OpenLiberty/release-manager) adds label release:YY00X-beta (where YY00X is the first beta version that included the functionality).

Beta Blog (Complete by beta eGA)

[ ] Beta blog issue created and populated using the Open Liberty BETA blog post template.
- Add a link to the beta blog issue in the Documents section.
- Note: This is for inclusion into the overall beta release blog post. If, in addition, you'd also like to create a dedicated blog post about your feature, then follow the "Standalone Feature Blog Post" instructions under the Other Deliverables section.
GA

A feature is ready to GA after it is Feature Complete and has obtained all necessary Focal Point Approvals.

Feature Complete

[x] Feature implementation and tests completed.
- [x] All PRs are merged.
- [x] All related/child issues are closed.
- [x] All stop ship issues are completed.
[x] Legal: all necessary approvals granted.
[x] Translation: Feature may only proceed to GA if it has either Translation - Complete or Translation - Missing label
- If all translation has been delivered to release branch, feature owner adds label Translation - Complete.
- If missing translation does not cause a break in functionality, nor a security or production outage risk, feature owner adds label Translation - Missing.
- Once all missing translations are delivered, the Translation - Missing label is replaced with Translation - Complete.
- If missing translation could cause a break in functionality or a security or production outage risk, feature owner adds the Translation - Blocked label.
- Featues with Translation - Blocked may NOT proceed to GA until the label has been replaced with either Translation - Missing or Translation - Complete.
- For further guidance, contact Globalization focal point or the Release Architect.
[x] GA development complete and feature ready for inclusion in a GA release
- Add label target:ga and the appropriate target:YY00X (where YY00X is the targeted GA version).
- Inclusion in a release requires the completion of all Focal Point Approvals.

Focal Point Approvals (Complete by Feature Complete Date)

These occur only after GA of this feature is requested (by adding a target:ga label). GA of this feature may not occur until all approvals are obtained.

All Features

[x] APIs/Externals - Externals have been reviewed or N/A. (OpenLiberty/externals-approvers)
- Approver adds label focalApproved:externals
[x] Demo - Demo is scheduled for an upcoming EOI or N/A. (OpenLiberty/demo-approvers)
- Add comment @OpenLiberty/demo-approvers Demo scheduled for EOI [Iteration Number] to this issue.
- Approver adds label focalApproved:demo.
[x] FAT - All Tests complete and running successfully in SOE or N/A. (OpenLiberty/fat-approvers)
- Approver adds label focalApproved:fat.

Design Approved Features

[x] ID - Documentation is complete or N/A. (OpenLiberty/id-approvers)
- Approver adds label focalApproved:id.
- NOTE: If only trivial documentation changes are required, you may reach out to the ID Feature Focal to request a ID Required - Trivial label. Unlike features with regular ID requirement, those with ID Required - Trivial label do not have a hard requirement for a Design/UFO.
[x] InstantOn - InstantOn capable or N/A. (OpenLiberty/instantOn-approvers)
- Approver adds label focalApproved:instantOn.
[x] Performance - Performance testing is complete or N/A. (OpenLiberty/performance-approvers)
- Approver adds label focalApproved:performance.
[x] Serviceability - Serviceability has been addressed or N/A. (OpenLiberty/serviceability-approvers)
- Approver adds label focalApproved:sve.
[x] STE - Skills Transfer Education chart deck is complete or N/A. (OpenLiberty/ste-approvers)
- Approver adds label focalApproved:ste.
[x] SVT - System Verification Test is complete or N/A. (OpenLiberty/svt-approvers)
- Approver adds label focalApproved:svt.

Remove Beta Fencing (Complete by Feature Complete Date)

[x] Beta guards are removed, or N/A
- Only after all necessary Focal Point Approvals have been granted.

GA Blog (Complete by Friday after GM)

[x] GA Blog issue created and populated using the Open Liberty GA release blog post template.
- Add a link to the GA Blog issue in the Documents section.
- Note: This is for inclusion into the overall release blog post. If, in addition, you'd also like to create a dedicated blog post about your feature, then follow the "Standalone Feature Blog Post" instructions under the Other Deliverables section.

Post GM (Complete before GA)

[x] After confirming this feature has been included in the GM driver, feature owner closes this issue.

Post GA

[ ] Remove the target:ga and target:YY00X labels, and add the appropriate release:YY00X. (OpenLiberty/release-manager)
Other Deliverables
[ ] Standalone Feature Blog Post - A blog post specifically about your feature or N/A. (Feature owner and OpenLiberty/release-architect)
- This should be strongly considered for larger or more prominent features.
- Follow instructions in the blogs repo.
[ ] OL Guides - OL Guides assessment is complete or N/A. (OpenLiberty/guide-assessment)
[ ] Dev Experience - Developer Experience & Tools work is complete or N/A. (OpenLiberty/dev-experience-assessment)

tjwatson commented 1 year ago

UFO review comments/questions:

Add details on environments tested: bare metal, container etc. - clarify the performance characteristics across environments. If they all show the same then state that
What about volume mounted logs? Will that show different performance behaviors?
Questions about how the VM writes to disk, open/sync/close or writes to an open file that flush over time (async).
Question on if we can warn users when we detect possible slow environments.
- Determined it is overkill to do this without significant performance issues for environments tested.
How do we inform users they need to look into verbose GC logs?
- verbose GC logs are meant for L2 not for customers
- Make sure must gather docs for liberty are updated to also gather the GC logs
Should this be disabled for performance benchmarks when comparing other JVMs? No, should benchmark what the customer runs with by default.
Run SOE tests with the option enabled to determine the log file size impact on the build logs gathered during the test runs.
- Do not expect the size to impact the SOE tests much, should compress very well when zipped up.
Consider updating tWAS defaults to align with the Liberty defaults being proposed in this UFO
Consider placing the GC logs under a subfolder in the logs/ directory. Something like logs/gclogs/? Need to check with some team on an opinion there (I forgot to make a note on the team to ask).
Do logs have sensitive data? Noted this will be covered later. Assume this will be discussed the review to finish to the end of the UFO document.
Note that server.env is available to set the options for environment values.
Verbose GC settings must only apply to start and run. Further consideration is needed for the checkpoint action.
Investigate looking in $JAVA_HOME/release to determine JVM variant. This file is standard Java 9+. Has things like JVM_VARIANT="Openj9" that could be quickly scanned to determine what options to use.
Considerations for what the native launcher on Z needs for this feature.
InstantOn support will need to consider how verbose GC options can change from checkpoint to restore
Additional communication is needed for Z with websphere Liberty
- How to use log files instead of jobs for verbose GC on Z
How to beta this? The server script (.sh and .bat) need updated and the changes will be the same for GA and beta releases.
- While in beta the option is disabled by default. In the beta zip add a etc/server.env that enables the option for verbose GC.

tjwatson commented 1 year ago

Part 2 UFO Review

Should we do this for hotspot? It is technically possible with a release file to detect
- Initial verbose gc is required for OpenJ9 (IBM JVM/Semeru) not hotspot.
- Raise question to L2 to figure out what % of memory cases is on hotspot vs IBM J9
- Document needed for the JVM options to do verbose GC with Hotspot
Need to coordinate with checkpoint action until we consume EA OpenJ9 build that has fix for InstantOn
Should there be a new section in the Logs documentation for gc logs? Answer - Yes.
Discussed the fact that verbose GC logs will get put into a checkpoint image (InstantOn). Should be small though, so not concerned.
Discussed security concerns around jvmargs in verbose GC logs. Reach out to Gary Pitcher for concerns. On call we thought it was "safe", but confirmation should be done with the security team.
Discussed log name, decided not to make it unique per server.
- Is the gc log name an external now? Yes, we likely will consider it an external, the name of the log will be documented in the Logs documentation. Do not call out the log name explicitly as an external (the same as console, messages and trace logs).
User settings for GC log rotations take precidence.
Discussed current customers using -verbose:bc to console log. Now they get good behavior of rotating GC logs. But this is a behavior change
- Should detect any user specified verbosegc option and not do anything if they already have it configured.

NottyCode commented 1 year ago

@kgibm can you add a comment to indicate how the socialization feedback was addressed?

kgibm commented 1 year ago

@NottyCode Sure. How each item was addressed on slides 38-42 of the UFO; copying in:

Add details on environments tested: bare metal, container etc. - clarify the performance characteristics across environments. If they all show the same then state that
- Added details to performance slides
What about volume mounted logs? Will that show different performance behaviors?
- Container environment was tested
Questions about how the VM writes to disk, open/sync/close or writes to an open file that flush over time (async).
- Added to Feature Design slide
Make sure must gather docs for liberty are updated to also gather the GC logs
- Added to Communication slide
Run SOE tests with the option enabled to determine the log file size impact on the build logs gathered during the test runs.
- Added to System Test Impact slide
Consider placing the GC logs under a subfolder in the logs/ directory. Something like logs/gclogs/? Need to check with some team on an opinion there (I forgot to make a note on the team to ask).
- Decided in second meeting that this is unnecessary
Note that server.env is available to set the options for environment values.
- Added to Communication slide
Verbose GC settings must only apply to start and run. Further consideration is needed for the checkpoint action.
- Added to Feature Design slide
Investigate looking in $JAVA_HOME/release to determine JVM variant. This file is standard Java 9+. Has things like JVM_VARIANT="Openj9" that could be quickly scanned to determine what options to use.
- Discussed in second meeting and outcome covered in a subsequent slide
Considerations for what the native launcher on Z needs for this feature.
- Added to Feature Design slide
InstantOn support will need to consider how verbose GC options can change from checkpoint to restore
- Tom will follow up on this as part of InstantOn work
Additional communication is needed for Z with websphere Liberty
- Added to Communication slide
How to use log files instead of jobs for verbose GC on Z
- Added to Communication slide
How to beta this? The server script (.sh and .bat) need updated and the changes will be the same for GA and beta releases.
- Added to Beta slide
While in beta the option is disabled by default. In the beta zip add a etc/server.env that enables the option for verbose GC.
- Added to Beta slide
Should we do this for hotspot? It is technically possible with a release file to detect
- L2 and Don both said HotSpot is a low proportion of cases, and we agreed on the call to leave this out of this MVP
Document needed for the JVM options to do verbose GC with Hotspot
- Added to Communication page
Need to coordinate with checkpoint action until we consume EA OpenJ9 build that has fix for InstantOn
- Tom will handle
Should there be a new section in the Logs documentation for gc logs?
- Yes, added to Communication Page
Discussed the fact that verbose GC logs will get put into a checkpoint image (InstantOn).
- Discussed and should be small so not concerned.
Discussed security concerns around jvmargs in verbose GC logs. Reach out to Gary Pitcher for concerns. On call we thought it was "safe", but confirmation should be done with the security team.
- Gary approved by email
Discussed log name, decided not to make it unique per server. Is the gc log name an external now?
- Yes, we likely will consider it an external, the name of the log will be documented in the Logs documentation. Do not call out the log name explicitly as an external (the same as console, messages and trace logs).
User settings for GC log rotations take precedence.
- Added a test
Discussed current customers using -verbose:bc to console log. Now they get good behavior of rotating GC logs. But this is a behavior change. Should detect any user specified verbosegc option and not do anything if they already have it configured.
- Updated slides

NottyCode commented 1 year ago

@kgibm I'm not seeing the following updates:

Run SOE tests with the option enabled to determine the log file size impact on the build logs gathered during the test runs. Added to System Test Impact slide
Verbose GC settings must only apply to start and run. Further consideration is needed for the checkpoint action. Added to Feature Design slide
How to use log files instead of jobs for verbose GC on Z Added to Communication slide

kgibm commented 1 year ago

@NottyCode

I'm not seeing the following updates:

* Run SOE tests with the option enabled to determine the log file size impact on the build logs gathered during the test runs.
  Added to System Test Impact slide

Sorry, that's on the Automated Testing slide 28 instead. I'll update the comment.

* Verbose GC settings must only apply to start and run. Further consideration is needed for the checkpoint action.
  Added to Feature Design slide

On slide 12: "Use SERVER_*_JAVA_OPTIONS so that it only applies to start and run actions, not all actions"

* How to use log files instead of jobs for verbose GC on Z
  Added to Communication slide

On slide 18: "and how to use HFS/ZFS instead with proper tokens if desired"

rsherget commented 5 months ago

@OpenLiberty/demo-approvers Demo scheduled for EOI 24.04

donbourne commented 5 months ago

Serviceability Approval Comment - Please answer the following questions for serviceability approval:

UFO -- does the UFO identify the most likely problems customers will see and identify how the feature will enable them to diagnose and solve those problems without resorting to raising a PMR? Have these issues been addressed in the implementation?
Test and Demo -- As part of the serviceability process we're asking feature teams to test and analyze common problem paths for serviceability and demo those problem paths to someone not involved in the development of the feature (eg. L2, test team, or another development team).
a) What problem paths were tested and demonstrated? b) Who did you demo to? c) Do the people you demo'd to agree that the serviceability of the demonstrated problem scenarios is sufficient to avoid PMRs for any problems customers are likely to encounter, or that L2 should be able to quickly address those problems without need to engage L3?
SVT -- SVT team is often the first team to try new features and often encounters problems setting up and using them. Note that we're not expecting SVT to do full serviceability testing -- just to sign-off on the serviceability of the problem paths they encountered. a) Who conducted SVT tests for this feature? b) Do they agree that the serviceability of the problems they encountered is sufficient to avoid PMRs, or that L2 should be able to quickly address those problems without need to engage L3?
Which L2 / L3 queues will handle PMRs for this feature? Ensure they are present in the contact reference file and in the queue contact summary, and that the respective L2/L3 teams know they are supporting it. Ask Don Bourne if you need links or more info.
Does this feature add any new metrics or emit any new JSON events? If yes, have you updated the JMX metrics reference list / Metrics reference list / JSON log events reference list in the Open Liberty docs?

rsherget commented 4 months ago

@OpenLiberty/serviceability-approvers

UFO -- does the UFO identify the most likely problems customers will see and identify how the feature will enable them to diagnose and solve those problems without resorting to raising a PMR? Have these issues been addressed in the implementation?

Yes, the UFO identifies the most likely problems customers will see, as well as how to debug/solve them. The scenarios have also been tested with FAT testing.

Test and Demo -- As part of the serviceability process we're asking feature teams to test and analyze common problem paths for serviceability and demo those problem paths to someone not involved in the development of the feature (eg. L2, test team, or another development team). a) What problem paths were tested and demonstrated?
- Default Server creates verbosegc log
- Adding VERBOSEGC=false to server.env turns off logging.
- Adding custom verbosegc configuration takes precedence over the default.
- Adding custom configuration while VERBOSEGC=false is in server.env still allows user configuration to work.
- Adding VERBOSEGC=true still creates verbosegc log.
- Non-IBM java versions don't create verbosegc log.

b) Who did you demo to? Jim Blye c) Do the people you demo'd to agree that the serviceability of the demonstrated problem scenarios is sufficient to avoid PMRs for any problems customers are likely to encounter, or that L2 should be able to quickly address those problems without need to engage L3? Yes

SVT -- SVT team is often the first team to try new features and often encounters problems setting up and using them. Note that we're not expecting SVT to do full serviceability testing -- just to sign-off on the serviceability of the problem paths they encountered. a) Who conducted SVT tests for this feature? No explicit SVT was required but Brian Hanczaryk is the SVT Feature Focal Point b) Do they agree that the serviceability of the problems they encountered is sufficient to avoid PMRs, or that L2 should be able to quickly address those problems without need to engage L3? No explicit SVT was performed for this feature.
Which L2 / L3 queues will handle PMRs for this feature? Ensure they are present in the contact reference file and in the queue contact summary, and that the respective L2/L3 teams know they are supporting it. Ask Don Bourne if you need links or more info.

WAS L2: ADM WAS L3: Kernel

Does this feature add any new metrics or emit any new JSON events? If yes, have you updated the JMX metrics reference list / Metrics reference list / JSON log events reference list in the Open Liberty docs?

N/A

rsherget commented 4 months ago

@OpenLiberty/ste-approvers The STE Slidedeck has been uploaded to the STE Archive.

rsherget commented 4 months ago

@OpenLiberty/svt-approvers - There are no SVT requirements for this feature. Please let me know approval can be granted or if anything else is needed.

gnadell commented 4 months ago

WASSDK Support is good with the STE slides. Hence approving.

rsherget commented 4 months ago

@OpenLiberty/performance-approvers Can you please review the Performance approval for this feature? Please let me know if approval can be granted or if anything else is needed.

rsherget commented 4 months ago

@OpenLiberty/instanton-approvers Can you please review the InstantOn approval for this feature? Please let me know if approval can be granted or if anything else is needed.

chirp1 commented 4 months ago

The developer opened the following documentation issue: https://github.com/OpenLiberty/docs/issues/7240 The ID team has incorporated the updates. The developer has approved the updates. Approving.

LifeIsGood524 commented 4 months ago

This PR is merged and in a release build. Consulted with Eric and Harry....closing issue.... @rsherget @hlhoots

LifeIsGood524 commented 4 months ago

see above comment

OpenLiberty / open-liberty