OpenLiberty / open-liberty

Open Liberty is a highly composable, fast to start, dynamic application server runtime environment
https://openliberty.io
Eclipse Public License 2.0
1.14k stars 585 forks source link

Provide a way to send Liberty logs to OpenTelemetry #27711

Open donbourne opened 6 months ago

donbourne commented 6 months ago

Description

We need a way for users to be able to direct their Liberty logs to OpenTelemetry.


Documents

When available, add links to required feature documents. Use "N/A" to mark particular documents which are not required by the feature.

General Instructions

The process steps occur roughly in the order as presented. Process steps occasionally overlap.

Each process step has a number of tasks which must be completed or must be marked as not applicable ("N/A").

Unless otherwise indicated, the tasks are the responsibility of the feature owner or a delegate of the feature owner.

If you need assistance, reach out to the OpenLiberty/release-architect.

Important: Labels are used to trigger particular steps and must be added as indicated.


Prioritization (Complete Before Development Starts)

The OpenLiberty/chief-architect and area leads are responsible for prioritizing the features and determining which features are being actively worked on.

Prioritization

Design preliminaries determine whether a formal design, which will be provided by an Upcoming Feature Overview (UFO) document, must be created and reviewed. A formal design is required if the feature requires any of the following: UI, Serviceability, SVT, Performance testing, or non-trivial documentation/ID. Furthermore, each identified item places a blocking requirement on another team so it must be identified early in the process. The feature owner may check-off the item if they know it doesn't apply, but otherwise they should work with the focal point to determine what work, if any, will be necessary and make them aware of it.

Design Preliminaries

Design

No Design

FAT Documentation

A feature must be prioritized before any implementation work may begin to be delivered (inaccessible/no-ship). However, a design focused approach should still be applied to features, and developers should think about the feature design prior to writing and delivering any code.
Besides being prioritized, a feature must also be socialized (or No Design Approved) before any beta code may be delivered. All new Liberty content must be inaccessible in our GA releases until it is Feature Complete by either marking it kind=noship or beta fencing it.
Code may not GA until this feature has obtained the Design Approved or No Design Approved label, along with all other tasks outlined in the GA section.

Feature Development Begins

Legal and Translation

In order to avoid last minute blockers and significant disruptions to the feature, the legal items need to be done as early in the feature process as possible, either in design or as early into the development as possible. Similarly, translation is to be done concurrently with development. All items below MUST be completed before beta & GA is requested.

Innovation (Complete 1 week before Beta & GA Feature Complete Date)

Legal (Complete before Beta & GA Feature Complete Date)

Translation (Complete by Beta & GA Feature Complete Date)

In order to facilitate early feedback from users, all new features and functionality should first be released as part of a beta release.

Beta Code

Beta Blog (Complete by beta eGA)

A feature is ready to GA after it is Feature Complete and has obtained all necessary Focal Point Approvals.

Feature Complete

Focal Point Approvals (Complete by Feature Complete Date)

These occur only after GA of this feature is requested (by adding a target:ga label). GA of this feature may not occur until all approvals are obtained.

All Features

Design Approved Features

Remove Beta Fencing (Complete by Feature Complete Date)

GA Blog (Complete by Friday after GM)

Post GM (Complete before GA)

Post GA

pgunapal commented 4 months ago

Design [DRAFT] :

Subject to change

High Level User Story: As an Operations engineer, I want to be able to export logs from Open Liberty to an Open Telemetry Exporter.

OpenTelemetry defines a Logs Bridge API for emitting LogRecords. OpenTelemetry provides a Logs Bridge API and SDK, which can be used together with existing logging libraries to automatically inject the trace context in the emitted logs, and provide an easy way to send the logs via OTLP. Instead of modifying each logging statement, log appenders use the API to bridge logs from existing logging libraries to the OpenTelemetry data model, where the SDK controls how the logs are processed and exported. The typical log SDK configuration installs a log record processor and exporter.

The LogRecordProcessor from the Logs SDK allows us to process and decorate the LogRecord fields to map to OTel Log Data Model.

BatchLogRecordProcessor and SimpleLogRecordProcessor are paired with LogRecordExporter, which is responsible for sending telemetry data to a particular backend.

Feature Design:

In Open Liberty, Open Telemetry is initialized using the SDK autoconfiguration extension, instead of manually creating the OpenTelemetry instance by using the SDK builders directly in the code. This approach allows you to autoconfigure the OpenTelemetry SDK based on a standard set of supported environment variables and system properties. Hence, the logging providers can be configured using environment variables. Ref: https://opentelemetry.io/docs/languages/java/instrumentation/#autoconfiguration

Since, the mpTelemetry-2.0 OpenTelemetry instance is dependent on application thread context, it will be difficult to get the instance, when we are not apart of the application thread context, such as during server start up. Hence, we need to explicitly create a new server-level Open Telemetry instance, which would have its own server-specific OTel configuration. This will also work for the multi-app scenarios as well.

By default, the SimpleLogRecordProcessor will be enabled, where the records will be send immediately. However, if you want to send the records in batches, you can also configure the following logging specific Batch LogRecord Processor Environment variables to configure how often and how to export the logs over, and as well as log record limits for attributes. (https://opentelemetry.io/docs/specs/otel/configuration/sdk-environment-variables/#batch-logrecord-processor)

OTEL_BLRP_SCHEDULE_DELAY : Delay interval (in milliseconds) between two consecutive exports (Default = 1000)
OTEL_BLRP_EXPORT_TIMEOUT : Maximum allowed time (in milliseconds) to export data (Default = 30000)
OTEL_BLRP_MAX_QUEUE_SIZE : Maximum queue size (Default = 2048)
OTEL_BLRP_MAX_EXPORT_BATCH_SIZE : Maximum batch size (Default = 512)

OTEL_LOGRECORD_ATTRIBUTE_VALUE_LENGTH_LIMIT : Maximum allowed attribute value size (Default = no limit)
OTEL_LOGRECORD_ATTRIBUTE_COUNT_LIMIT : Maximum allowed log record attribute count (Default = 128)
<feature>mpTelemetry-2.0</feature>
…
<mpTelemetry logSources=“message, trace, accessLog, ffdc, audit”/>

Mapping Open Liberty Log Record to Open Telemetry Logs Data Model (https://opentelemetry.io/docs/specs/otel/logs/data-model/#log-and-event-record-definition) Note: When formatting the event, JSONify the event, so the event is structured properly.

Open Liberty Log Record | Open Telemetry Logs Data Model
=============================================
ibm_datetime = Timestamp
ext_traceId = TraceId
ext_spanId = SpanId
loglevel = SeverityText
(Refer to table below) = SeverityNumber 
message  = Body *** Should be the entire JSON payload instead?
host = Resource[“host.name”]
service.name = Resource[“service.name”]
io.openliberty.microprofile.telemetry = InstrumentationScope
Map the rest of the fields as Key:Value pairs = Attributes[“Key”]
Throwable example snippet:
throwable.getClass().getName() = Attributes[SemanticAttributes.EXCEPTION_TYPE]
throwable.getMessage = Attributes[SemanticAttributes.EXCEPTION_MESSAGE]
throwable.printStackTrace() = Attributes[SemanticAttributes.EXCEPTION_STACKTRACE]

Log Level Mapping to Open Telemetry Severity Text (https://opentelemetry.io/docs/specs/otel/logs/data-model/#severity-fields) (https://opentelemetry.io/docs/specs/otel/logs/data-model-appendix/#appendix-b-severitynumber-example-mappings)

Open Liberty Log Level | Open Telemetry Logs Severity Text / Number
====================================================
FATAL = FATAL / 21
SEVERE = ERROR / 17
WARNING = WARN / 13
AUDIT = INFO2 / 10
INFO = INFO / 9
CONFIG = DEBUG4 / 8
DETAIL = DEBUG3 / 7
FINE = DEBUG2 / 6
FINER = DEBUG / 5
FINEST = TRACE / 1
pgunapal commented 4 months ago

High-level Implementation Design Details [DRAFT] :

Subject to change

-buildpath: io.openliberty.microprofile.telemetry.internal.common;version=latest,\ io.openliberty.io.opentelemetry.2.0;version=latest

- Ensure the correct metatype is defined for the server configuration of mpTelemetry-2.0 (e.g. `logSources`)
- Update the [OpenTelemetryVersionedConfigurationImpl](https://github.com/OpenLiberty/open-liberty/blob/integration/dev/io.openliberty.microprofile.telemetry.2.0.internal/src/io/openliberty/microprofile/telemetry20/internal/config/OpenTelemetryVersionedConfigurationImpl.java) class file to remove the following lines, since we should be enabling Logs, as part of this feature.

telemetryProperties.put(OpenTelemetryConstants.CONFIG_LOGS_EXPORTER_PROPERTY, "none"); telemetryProperties.put(OpenTelemetryConstants.ENV_LOGS_EXPORTER_PROPERTY, "none");

- In the `OpenTelemetryHandler.activate()` method, retrieve and set the server-level OpenTelemetryInfo object. It will be using the [OpenTelemetryAccessor](https://github.com/OpenLiberty/open-liberty/blob/integration/dev/io.openliberty.microprofile.telemetry.internal.common/src/io/openliberty/microprofile/telemetry/internal/interfaces/OpenTelemetryAccessor.java) interface from the `io.openliberty.microprofile.telemetry.internal.common` project (TBD - details to follow from MP Telemetry team)
- Get the configured OpenTelemetry LogProvider by calling the `OpenTelemetryInfo.getOpenTelemetry().getLogsBridge()`. (https://javadoc.io/doc/io.opentelemetry/opentelemetry-api/latest/io/opentelemetry/api/OpenTelemetry.html)
- In the `OpenTelemetryHandler.formatEvents()` method, by using the previously retrieved LogProvider, get the LogBuilder (https://javadoc.io/doc/io.opentelemetry/opentelemetry-api/latest/io/opentelemetry/api/logs/LoggerProvider.html), with the configured instrumentation name, and use the builder to map the Open Liberty event records to the appropriate OpenTelemetry Log Data model. 
- Once, the builder is mapped with the corresponding fields, call `builder.emit()` to export the logs to the exporter.

Below is a generic high-level code snippet on how to retrieve the OpenTelemetry LogProvider/builder, and to map generic JUL Log Record fields to Open Telemetry Log Data Model, and then to export it to the configured exporter.

Snippet is from the [Open Telemetry JUL Java agent instrumentation](https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/instrumentation/java-util-logging/javaagent/src/main/java/io/opentelemetry/javaagent/instrumentation/jul/JavaUtilLoggingHelper.java) :

... String instrumentationName = logger.getName(); if (instrumentationName == null || instrumentationName.isEmpty()) { instrumentationName = "ROOT"; } LogRecordBuilder builder = GlobalOpenTelemetry.get() .getLogsBridge() .loggerBuilder(instrumentationName) .build() .logRecordBuilder(); mapLogRecord(builder, logRecord); builder.emit();

private static void mapLogRecord(LogRecordBuilder builder, LogRecord logRecord) { // message String message = FORMATTER.formatMessage(logRecord); if (message != null) { builder.setBody(message); }

// time
long timestamp = logRecord.getMillis();
builder.setTimestamp(timestamp, TimeUnit.MILLISECONDS);

// level
Level level = logRecord.getLevel();
if (level != null) {
  builder.setSeverity(levelToSeverity(level));
  builder.setSeverityText(logRecord.getLevel().getName());
}

AttributesBuilder attributes = Attributes.builder();

// throwable
Throwable throwable = logRecord.getThrown();
if (throwable != null) {
  attributes.put(SemanticAttributes.EXCEPTION_TYPE, throwable.getClass().getName());
  attributes.put(SemanticAttributes.EXCEPTION_MESSAGE, throwable.getMessage());
  StringWriter writer = new StringWriter();
  throwable.printStackTrace(new PrintWriter(writer));
  attributes.put(SemanticAttributes.EXCEPTION_STACKTRACE, writer.toString());
}

if (captureExperimentalAttributes) {
  Thread currentThread = Thread.currentThread();
  attributes.put(SemanticAttributes.THREAD_NAME, currentThread.getName());
  attributes.put(SemanticAttributes.THREAD_ID, currentThread.getId());
}

builder.setAllAttributes(attributes.build());

// span context
builder.setContext(Context.current());

} ...



Note: OpenTelemetry have implemented Log Appenders [Log4J](https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/instrumentation/log4j/log4j-appender-2.17/library/src/main/java/io/opentelemetry/instrumentation/log4j/appender/v2_17/internal/LogEventMapper.java#L104) and [Logback](https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/instrumentation/logback/logback-appender-1.0/library/src/main/java/io/opentelemetry/instrumentation/logback/appender/v1_0/internal/LoggingEventMapper.java).
benjamin-confino commented 4 months ago

Hello.

In the OpenTelemetryHandler.activate() method, retrieve and set the server-level OpenTelemetryInfo object. It will be using the OpenTelemetryAccessor interface from the io.openliberty.microprofile.telemetry.internal.common project (TBD - details to follow from MP Telemetry team)

This should work fine but there is one hidden gotcha to be aware off. In OpenTelemetryInfoFactory we have a check when an OpenTelemetryInfo is created if (j2EEName.startsWith("io.openliberty") || j2EEName.startsWith("com.ibm.ws")) { you will need to make sure this if statement returns false when OpenTelemetryHandler calls us. I suspect without modification it will be true.

pgunapal commented 4 months ago

Thanks @benjamin-confino ! Good point, will make sure that doesn't break for us. Right, getOpenTelemetryInfo() will be called by internal code, so it would be true.

yasmin-aumeeruddy commented 1 month ago

POC notes from July 15th:

Further point at end of UFO: Ensure that negative cases are well tested and that instantOn is considered when testing the feature.

pgunapal commented 1 month ago

@yasmin-aumeeruddy Thank you for the notes, I have updated the UFO with the comments from the socialization.

NottyCode commented 1 month ago

Slide 9 - I think we need an epic for access logs and audit. I understand why mapping these log events is non-trivial and less important, but we should do it, especially for access logs. Slide 12 - It is odd that we don't map anything into the body for FFDC. This is raising my "something wrong" thing.

pgunapal commented 3 weeks ago

@NottyCode For Slide 9, we have opened two epics to address access and audit logs

For Slide 12, we decided to map the exception message from the triggered event to the body, in addition to Semantic Convention Attribute name (exception.message).

I have updated the UFO with the above.

pgunapal commented 2 weeks ago

@OpenLiberty/demo-approvers Demo scheduled for EOI 24.17

pgunapal commented 2 weeks ago

@OpenLiberty/id-approvers ID Doc Issue opened: https://github.com/OpenLiberty/docs/issues/7459

pgunapal commented 2 weeks ago

@OpenLiberty/externals-approvers Can you please review the approval for this feature, there are no exposed public APIs as part of this particular feature.

chirp1 commented 2 weeks ago

Slack with Prashanth, David, Ram, me. Prashanth provided the necessary info in the following doc issue: https://github.com/OpenLiberty/docs/issues/7459. I approved the feature.

tngiang73 commented 1 week ago

@pgunapal: WASWIN is good with the STE slides. STE approved.

donbourne commented 1 week ago

OL:

Serviceability Approval Comment - Please answer the following questions for serviceability approval:

  1. UFO -- does the UFO identify the most likely problems customers will see and identify how the feature will enable them to diagnose and solve those problems without resorting to raising a PMR? Have these issues been addressed in the implementation?

  2. Test and Demo -- As part of the serviceability process we're asking feature teams to test and analyze common problem paths for serviceability and demo those problem paths to someone not involved in the development of the feature (eg. IBM Support, test team, or another development team).
    a) What problem paths were tested and demonstrated? b) Who did you demo to? c) Do the people you demo'd to agree that the serviceability of the demonstrated problem scenarios is sufficient to avoid PMRs for any problems customers are likely to encounter, or that IBM Support should be able to quickly address those problems without need to engage SMEs?

  3. SVT -- SVT team is often the first team to try new features and often encounters problems setting up and using them. Note that we're not expecting SVT to do full serviceability testing -- just to sign-off on the serviceability of the problem paths they encountered. a) Who conducted SVT tests for this feature? b) Do they agree that the serviceability of the problems they encountered is sufficient to avoid PMRs, or that IBM Support should be able to quickly address those problems without need to engage SMEs?

  4. Which IBM Support / SME queues will handle PMRs for this feature? Ensure they are present in the contact reference file and in the queue contact summary, and that the respective IBM Support/SME teams know they are supporting it. Ask Don Bourne if you need links or more info.

  5. Does this feature add any new metrics or emit any new JSON events? If yes, have you updated the JMX metrics reference list / Metrics reference list / JSON log events reference list in the Open Liberty docs?

donbourne commented 5 days ago

@clarkek123 will be handling the serviceability approval for this epic.

dave-waddling commented 4 days ago

Thanks for completing the FTS. The results of the mini-SOE look good so adding the FAT Focal approval.

pgunapal commented 4 days ago

@clarkek123 I have filled out the template below, can you please review the Serviceability approval for this feature?

Serviceability Approval Comment - Please answer the following questions for serviceability approval:

UFO -- does the UFO identify the most likely problems customers will see and identify how the feature will enable them to diagnose and solve those problems without resorting to raising a PMR? Have these issues been addressed in the implementation?

A: Yes. For the logging component of the mpTelemetry-2.0, the logs should be exported automatically if the mpTelemetry-2.0 feature is enabled in the server.xml, along with the otel.sdk.disabled=true is configured either in Liberty or the application, and the otel.logs.exporter is set to a valid exporter. Below are some of the common user error scenarios for logs not being exported that we tested in FATs, as well as manually.

Test and Demo -- As part of the serviceability process we're asking feature teams to test and analyze common problem paths for serviceability and demo those problem paths to someone not involved in the development of the feature (eg. IBM Support, test team, or another development team). a) What problem paths were tested and demonstrated? b) Who did you demo to? c) Do the people you demo'd to agree that the serviceability of the demonstrated problem scenarios is sufficient to avoid PMRs for any problems customers are likely to encounter, or that IBM Support should be able to quickly address those problems without need to engage SMEs?

A: a) The problems paths mentioned above were also demoed to analyze the problem paths. b) Local, EOI, SVT, Performance teams. c) Yes

SVT -- SVT team is often the first team to try new features and often encounters problems setting up and using them. Note that we're not expecting SVT to do full serviceability testing -- just to sign-off on the serviceability of the problem paths they encountered. a) Who conducted SVT tests for this feature? b) Do they agree that the serviceability of the problems they encountered is sufficient to avoid PMRs, or that IBM Support should be able to quickly address those problems without need to engage SMEs?

A: a) Daniel Guinan b) Yes

Which IBM Support / SME queues will handle PMRs for this feature? Ensure they are present in the contact reference file and in the queue contact summary, and that the respective IBM Support/SME teams know they are supporting it. Ask Don Bourne if you need links or more info.

A: WAS L3: Logging

Does this feature add any new metrics or emit any new JSON events? If yes, have you updated the JMX metrics reference list / Metrics reference list / JSON log events reference list in the Open Liberty docs?

A: No

clarkek123 commented 4 days ago

Based on the information provided above for Serviceability showing common error paths testing with demo with approval from Local, SVT and Perfomance teams and SVT signoff on the paths included for serviceability, I have added the Serviceability approval for this feature.