Jaeger pod showing significant memory increase when executing app load against AcmeAir using mpTelemetry-1.1, microProfile-4.1 and webProfile-8.0 features.

hanczaryk commented 1 year ago

SVT: Jaeger pod showing significant memory increase when executing app load against AcmeAir using mpTelemetry-1.1, microProfile-4.1 and webProfile-8.0 features.

Describe the bug
SVT is testing a 24 hour app load stress run against AcmeAir using mpTelemetry-1.1, microProfile-4.1 and webProfile-8.0 features.

I'm testing using a recent daily openliberty build (wlp-1.0.84.cl231220231115-1101) in which this mpTelemetry-1.1 now tolerates older MP and EE versions.

The jaeger pod memory grows so steadily that it exhausts all memory available on the node. To combat this, I edit the jaeger deployment to specify 8Gi memory limit to avoid the entire node going down.

        resources:
          limits:
            cpu: 768m
            memory: 8Gi
          requests:
            cpu: 768m

With this 8Gi memory limit set, jaeger pod exceeds the memory limit in just under an hour during the Acmeair app load.

Here is a screenshot from the OCP console showing the Metrics for my jaeger pod with a significant memory growth over a 30 minute timeframe.

Steps to Reproduce
On an OpenShift cluster:

Deploy OpenTelemetry Collector Operator (optional)
Deploy AcmeAir with 5 microservices running mpTelemetry-1.1 on Liberty
Deploy Jaeger
Run the applications for a long period of time and ensure -- Spans are received on Jaeger -- No memory leak in Liberty containers

The following are instructions that work using MicroProfile with mpTelemetry.

Deploy OpenTelemetry Collector Operator

To deploy OpenTelemetry Collector Operator (optional), utilize the OpenShift console to install operator using defaults for 'Community OpenTelemetry Operator'. The following is a screenshot.

After the operator is installed, create instances for OpenTelemetry Instrumentation and OpenTelemetry Collector.

Deploy AcmeAir microservice applications running mpTelemetry-1.1 on Liberty

The 5 AcmeAir microservice repos are located at https://github.com/blueperf/ (choosing the microprofile-4.1 branch)

Follow the instructions from README.md at https://github.com/blueperf/acmeair-mainservice-java to install AcmeAir on OpenShift. To enabled verboseGC for this stress test, you can edit the jvm.options for each of the 5 microservices to include your desired values such as

[XXXX logs]# cat ../acmeair-authservice-java/src/main/liberty/config/jvm.options 
-Dhttp.keepalive=true
-Dhttp.maxConnections=100
-verbose:gc
-Xdump:heap
-Xaggressive
-Xverbosegclog:/logs/verbosegc.%seq.log,5,300000

Deploy Jaeger

To deploy Community Jaeger Operator, utilize the OpenShift console to install operator using default. The following is a screenshot.

After the Community Jaeger Operator is installed, use the OpenShift console to create a Jaeger instance. Before creating the Jaeger instance, switch to the 'acmeair' namespace and use 'jaeger-all-in-one-inmemory' for the name.

Run the applications for a long period of time

Use jmeter to drive AcmeAir application load for 24 hours. While the application load is running, access the Jaeger UI using the jaeger route shown in the OpenShift console. Ensure spans are received on Jaeger for the services.

The following are instruction modifications that work using MicroProfile 4.1 with mpTelemetry.

For the 5 microservice Dockerfile's I set the following env info

ENV OTEL_TRACES_EXPORTER=otlp
ENV OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger-all-in-one-inmemory-collector:4317
ENV OTEL_SERVICE_NAME=XXXservice
ENV OTEL_SDK_DISABLED=false
ENV OTEL_METRICS_EXPORTER=none

Expected behavior
I expect the the jaeger pod memory consumption will at some point plateau and stabilize.

Diagnostic information:

OpenLiberty Version: daily openliberty-all build (wlp-1.0.84.cl231220231115-1101)
Affected feature(s) : mpTelemetry-1.1

Java Version: Acmeair microservices java versions in use are

bash-4.4$ java -version
openjdk version "17.0.8.1" 2023-08-24
IBM Semeru Runtime Open Edition 17.0.8.1 (build 17.0.8.1+1)
Eclipse OpenJ9 VM 17.0.8.1 (build openj9-0.40.0, JRE 17 Linux amd64-64-Bit Compressed References 20230824_549 (JIT enabled, AOT enabled)
OpenJ9   - d12d10c9e
OMR      - e80bff83b
JCL      - 8ecf238a124 based on jdk-17.0.8.1+1)

server.xml configuration (WITHOUT sensitive information like passwords) The following is the server.xml for the acmeair-flightservice. Each of the 5 acmeair microservices are similar with the same features enabled.

<?xml version="1.0" encoding="UTF-8"?>
<server>
<!-- Enable features -->
<featureManager>
<feature>microProfile-4.1</feature>
<feature>webProfile-8.0</feature>
<feature>mpTelemetry-1.1</feature>
</featureManager>

<!-- To access this server from a remote client add a host attribute to the following element, e.g. host="*" -->
<httpEndpoint 
id="defaultHttpEndpoint" 
host="*" 
httpPort="${DEFAULT_HTTP_PORT}" 
httpsPort="${DEFAULT_HTTPS_PORT}">
<accessLogging 
  enabled="${ACCESS_LOGGING_ENABLED}" 
  filepath="${server.output.dir}/logs/http_defaultEndpoint_access.log" 
  logFormat='%h %u %t "%r" %s %b %D %{User-agent}i'>
</accessLogging>
</httpEndpoint>

<quickStartSecurity userName="${env.USERNAME}" userPassword="${env.PASSWORD}" />

<logging 
consoleFormat="${LOGGING_FORMAT}" 
consoleSource="message,trace,accessLog,ffdc,audit" 
messageFormat="${LOGGING_FORMAT}" 
messageSource="message,trace,accessLog,ffdc,audit" 
traceSpecification="${TRACE_SPEC}" />

<webApplication name="acmeair-flightservice" location="acmeair-flightservice-java-4.1.war" contextRoot="/flight">
<!-- enable visibility to third party apis -->
<classloader apiTypeVisibility="api,ibm-api,spec,stable,third-party" />
</webApplication>

<keyStore id="defaultKeyStore" password="secret" />

<cors 
domain="/flight" 
allowedOrigins="*" 
allowedMethods="GET, DELETE, POST, OPTIONS" 
allowedHeaders="*" 
allowCredentials="true" 
maxAge="3600" />
</server>

If it would be useful, upload the messages.log file found in $WLP_OUTPUT_DIR/messages.log The following snippet from acmeair-flightservice's messages.log shows the application starting successfully but eventually logging messages that fail to export spans when the jaeger pod has to restart when it runs out of memory.

[11/15/23, 18:47:24:330 UTC] 00000066 com.ibm.ws.app.manager.AppMessageHelper                      A CWWKZ0001I: Application acmeair-flightservice started in 3.092 seconds.
[11/15/23, 18:47:24:347 UTC] 0000003a com.ibm.ws.tcpchannel.internal.TCPPort                       I CWWKO0219I: TCP Channel defaultHttpEndpoint has been started and is now listening for requests on host *  (IPv6) port 9080.
[11/15/23, 18:47:24:349 UTC] 0000003a com.ibm.ws.tcpchannel.internal.TCPPort                       I CWWKO0219I: TCP Channel defaultHttpEndpoint-ssl has been started and is now listening for requests on host *  (IPv6) port 9443.
[11/15/23, 18:47:24:359 UTC] 0000003a com.ibm.ws.kernel.feature.internal.FeatureManager            A CWWKF0012I: The server installed the following features: [appSecurity-2.0, appSecurity-3.0, beanValidation-2.0, cdi-2.0, distributedMap-1.0, ejbLite-3.2, el-3.0, jaspic-1.1, jaxrs-2.1, jaxrsClient-2.1, jdbc-4.2, jndi-1.0, jpa-2.2, jpaContainer-2.2, jsf-2.3, json-1.0, jsonb-1.0, jsonp-1.1, jsp-2.3, jwt-1.0, managedBeans-1.0, microProfile-4.1, monitor-1.0, mpConfig-2.0, mpFaultTolerance-3.0, mpHealth-3.1, mpJwt-1.2, mpMetrics-3.0, mpOpenAPI-2.0, mpOpenTracing-2.0, mpRestClient-2.0, mpTelemetry-1.1, opentracing-2.0, servlet-4.0, ssl-1.0, webProfile-8.0, websocket-1.1].
[11/15/23, 18:47:24:359 UTC] 0000003a com.ibm.ws.kernel.feature.internal.FeatureManager            I CWWKF0008I: Feature update completed in 4.681 seconds.
[11/15/23, 18:47:24:359 UTC] 0000003a com.ibm.ws.kernel.feature.internal.FeatureManager            A CWWKF0011I: The defaultServer server is ready to run a smarter planet. The defaultServer server started in 6.238 seconds.
[11/15/23, 18:47:31:510 UTC] 00000064 SystemOut                                                    O useFlightDataRelatedCaching : true
[11/15/23, 18:47:31:635 UTC] 00000064 org.mongodb.driver.cluster                                   I Cluster created with settings {hosts=[acmeair-flight-db:27017], mode=SINGLE, requiredClusterType=UNKNOWN, serverSelectionTimeout='30000 ms', maxWaitQueueSize=500}
[11/15/23, 18:47:31:683 UTC] 00000064 com.acmeair.mongo.ConnectionManager                          I #### Mongo DB Server acmeair-flight-db ####
[11/15/23, 18:47:31:683 UTC] 00000064 com.acmeair.mongo.ConnectionManager                          I #### Mongo DB Port 27017 ####
[11/15/23, 18:47:31:683 UTC] 00000064 com.acmeair.mongo.ConnectionManager                          I #### Mongo DB is created with DB name acmeair_flightdb ####
[11/15/23, 18:47:31:684 UTC] 00000064 com.acmeair.mongo.ConnectionManager                          I #### MongoClient Options ####
[11/15/23, 18:47:31:684 UTC] 00000064 com.acmeair.mongo.ConnectionManager                          I maxConnectionsPerHost : 100
[11/15/23, 18:47:31:684 UTC] 00000064 com.acmeair.mongo.ConnectionManager                          I minConnectionsPerHost : 0
[11/15/23, 18:47:31:684 UTC] 00000064 com.acmeair.mongo.ConnectionManager                          I maxWaitTime : 120000
[11/15/23, 18:47:31:684 UTC] 00000064 com.acmeair.mongo.ConnectionManager                          I connectTimeout : 10000
[11/15/23, 18:47:31:684 UTC] 00000064 com.acmeair.mongo.ConnectionManager                          I socketTimeout : 0
[11/15/23, 18:47:31:684 UTC] 00000064 com.acmeair.mongo.ConnectionManager                          I sslEnabled : false
[11/15/23, 18:47:31:684 UTC] 00000064 com.acmeair.mongo.ConnectionManager                          I threadsAllowedToBlockForConnectionMultiplier : 5
[11/15/23, 18:47:31:685 UTC] 00000064 com.acmeair.mongo.ConnectionManager                          I Complete List : MongoClientOptions{description='null', applicationName='null', compressors='[]', readPreference=primary, writeConcern=WriteConcern{w=null, wTimeout=null ms, fsync=null, journal=null, retryWrites=false, readConcern=com.mongodb.ReadConcern@0, codecRegistry=org.bson.codecs.configuration.ProvidersCodecRegistry@d113f468, serverSelector=null, clusterListeners=[], commandListeners=[], minConnectionsPerHost=0, maxConnectionsPerHost=100, threadsAllowedToBlockForConnectionMultiplier=5, serverSelectionTimeout=30000, maxWaitTime=120000, maxConnectionIdleTime=0, maxConnectionLifeTime=0, connectTimeout=10000, socketTimeout=0, socketKeepAlive=true, sslEnabled=false, sslInvalidHostNamesAllowed=false, sslContext=null, alwaysUseMBeans=false, heartbeatFrequency=10000, minHeartbeatFrequency=500, heartbeatConnectTimeout=20000, heartbeatSocketTimeout=20000, localThreshold=15, requiredReplicaSetName='null', dbDecoderFactory=com.mongodb.DefaultDBDecoder$1@7f1ecf10, dbEncoderFactory=com.mongodb.DefaultDBEncoder$1@7e475cd5, socketFactory=null, cursorFinalizerEnabled=true, connectionPoolSettings=ConnectionPoolSettings{maxSize=100, minSize=0, maxWaitQueueSize=500, maxWaitTimeMS=120000, maxConnectionLifeTimeMS=0, maxConnectionIdleTimeMS=0, maintenanceInitialDelayMS=0, maintenanceFrequencyMS=60000, connectionPoolListeners=[]}, socketSettings=SocketSettings{connectTimeoutMS=10000, readTimeoutMS=0, keepAlive=true, receiveBufferSize=0, sendBufferSize=0}, serverSettings=ServerSettings{heartbeatFrequencyMS=10000, minHeartbeatFrequencyMS=500, serverListeners='[]', serverMonitorListeners='[]'}, heartbeatSocketSettings=SocketSettings{connectTimeoutMS=20000, readTimeoutMS=20000, keepAlive=true, receiveBufferSize=0, sendBufferSize=0}}
[11/15/23, 18:47:31:703 UTC] 00000064 org.mongodb.driver.cluster                                   I Cluster description not yet available. Waiting for 30000 ms before timing out
[11/15/23, 18:47:31:761 UTC] 0000007a org.mongodb.driver.connection                                I Opened connection [connectionId{localValue:1}] to acmeair-flight-db:27017
[11/15/23, 18:47:31:769 UTC] 0000007a org.mongodb.driver.cluster                                   I Monitor thread successfully connected to server with description ServerDescription{address=acmeair-flight-db:27017, type=STANDALONE, state=CONNECTED, ok=true, version=ServerVersion{versionList=[7, 0, 3]}, minWireVersion=0, maxWireVersion=21, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=30, roundTripTimeNanos=5403786}
[11/15/23, 18:47:31:787 UTC] 00000064 org.mongodb.driver.connection                                I Opened connection [connectionId{localValue:2}] to acmeair-flight-db:27017
[11/15/23, 18:49:31:375 UTC] 00000056 org.mongodb.driver.connection                                I Opened connection [connectionId{localValue:3}] to acmeair-flight-db:27017
[11/15/23, 18:49:37:862 UTC] 0000005a .apache.cxf.cxf.core.3.2:1.0.84.cl231220231115-1101(id=209)] I Setting the server's publish address to be /
[11/15/23, 18:49:38:031 UTC] 0000005a com.ibm.ws.webcontainer.servlet                              I SRVE0242I: [acmeair-flightservice] [/flight] [com.acmeair.web.FlightServiceApp]: Initialization successful.
[11/15/23, 18:49:38:359 UTC] 00000044 io.jaegertracing.Configuration                               I Initialized tracer=JaegerTracer(version=Java-1.3.1, serviceName=acmeair-flightservice, reporter=RemoteReporter(sender=HttpSender(), closeEnqueueTimeout=1000), sampler=ConstSampler(decision=true, tags={sampler.type=const, sampler.param=true}), tags={hostname=acmeair-flightservice-6668884d44-9sq8w, jaeger.version=Java-1.3.1, ip=10.254.15.129}, zipkinSharedRpcSpan=false, expandExceptionLogs=false, useTraceId128Bit=false)
[11/15/23, 18:49:38:359 UTC] 00000044 m.ibm.ws.microprofile.opentracing.jaeger.JaegerTracerFactory I CWMOT1001I: A JaegerTracer instance was created for the acmeair-flightservice application.  Tracing information is sent to http://jaeger-all-in-one-inmemory-collector:14268/api/traces.
[11/15/23, 18:49:38:485 UTC] 0000005a org.mongodb.driver.connection                                I Opened connection [connectionId{localValue:4}] to acmeair-flight-db:27017
[11/15/23, 18:49:45:125 UTC] 00000046 com.acmeair.loader.FlightLoader                              I Start loading flights
[11/15/23, 18:49:47:488 UTC] 00000046 com.acmeair.loader.FlightLoader                              I Finished loading in 2.363 seconds
[11/15/23, 18:50:56:332 UTC] 00000050 org.mongodb.driver.connection                                I Opened connection [connectionId{localValue:5}] to acmeair-flight-db:27017
[11/15/23, 18:50:58:707 UTC] 00000053 org.mongodb.driver.connection                                I Opened connection [connectionId{localValue:6}] to acmeair-flight-db:27017
[11/15/23, 19:09:55:740 UTC] 00000087 io.opentelemetry.exporter.internal.grpc.GrpcExporter         W Failed to export spans. Server responded with gRPC status code 2. Error message: Failed to connect to jaeger-all-in-one-inmemory-collector/172.30.60.9:4317
[11/15/23, 20:03:05:422 UTC] 00000087 io.opentelemetry.exporter.internal.grpc.GrpcExporter         W Failed to export spans. Server responded with gRPC status code 2. Error message: Failed to connect to jaeger-all-in-one-inmemory-collector/172.30.60.9:4317
[11/15/23, 20:54:46:468 UTC] 00000087 io.opentelemetry.exporter.internal.grpc.GrpcExporter         W Failed to export spans. Server responded with gRPC status code 2. Error message: Failed to connect to jaeger-all-in-one-inmemory-collector/172.30.60.9:4317
[11/15/23, 21:45:00:891 UTC] 00000087 io.opentelemetry.exporter.internal.grpc.GrpcExporter         W Failed to export spans. Server responded with gRPC status code 2. Error message: Failed to connect to jaeger-all-in-one-inmemory-collector/172.30.60.9:4317
[11/15/23, 23:25:58:256 UTC] 00000087 io.opentelemetry.exporter.internal.grpc.GrpcExporter         W Failed to export spans. Server responded with gRPC status code 2. Error message: Failed to connect to jaeger-all-in-one-inmemory-collector/172.30.60.9:4317
[11/16/23, 0:16:23:916 UTC] 00000087 io.opentelemetry.exporter.internal.grpc.GrpcExporter         W Failed to export spans. Server responded with gRPC status code 2. Error message: Failed to connect to jaeger-all-in-one-inmemory-collector/172.30.60.9:4317
[11/16/23, 3:40:52:337 UTC] 00000087 io.opentelemetry.exporter.internal.grpc.GrpcExporter         W Failed to export spans. Server responded with gRPC status code 2. Error message: Failed to connect to jaeger-all-in-one-inmemory-collector/172.30.60.9:4317
[11/16/23, 4:32:28:875 UTC] 00000087 io.opentelemetry.exporter.internal.grpc.GrpcExporter         W Failed to export spans. Server responded with gRPC status code 2. Error message: Connection reset
[11/16/23, 5:22:45:476 UTC] 00000087 io.opentelemetry.exporter.internal.grpc.GrpcExporter         W Failed to export spans. Server responded with gRPC status code 2. Error message: Failed to connect to jaeger-all-in-one-inmemory-collector/172.30.60.9:4317
[11/16/23, 6:13:59:453 UTC] 00000087 io.opentelemetry.exporter.internal.grpc.GrpcExporter         W Failed to export spans. Server responded with gRPC status code 2. Error message: Failed to connect to jaeger-all-in-one-inmemory-collector/172.30.60.9:4317
[11/16/23, 7:05:34:211 UTC] 00000087 io.opentelemetry.exporter.internal.grpc.GrpcExporter         W Failed to export spans. Server responded with gRPC status code 2. Error message: Connection reset
[11/16/23, 7:56:14:463 UTC] 00000087 io.opentelemetry.exporter.internal.grpc.GrpcExporter         W Failed to export spans. Server responded with gRPC status code 2. Error message: Connection reset

Additional context
The 5 Acmeair microservices don't show any memory increase and continue to run successfully throughout the 24 hour app load even though the jaeger pod restarts when the 8Gi memory limit is encountered.

I'd run previous AcmeAir app load on previous builds in October using mpTelemetry-1.1 with microProfile-6.1. While jaeger pod showed some memory growth, it wasn't nearly as significant as this test using mpTelemetry-1.1 with microProfile-6.1 and webProfile-8.0.

hanczaryk commented 1 year ago

After the 24 hour run with MP 4.1 and mpTelmetry-1.1 completed, I started a new run with MP 6.1. Using this daily openliberty build from 11/15, the jaeger pod still shows significant growth but MP 4.1 appears to be about double the growth of this recent MP 6.1 run. Here is a OCP console metrics screenshot from the recent MP 6.1 run.

Azquelt commented 1 year ago

How are you deploying Jaeger? In the default configuration, exported spans are stored in memory and in that case we would expect large continual memory growth. I'm not sure if these are the right docs for the Jaeger operator you're using, but if so then we need to use the production deployment type which stores the span data in elasticsearch. (While the default allInOne deployment type is generally used for testing, it's not suitable for a long-running test like this which will generate lots of trace data.)

It is, however, unexpected that you're getting more growth on MP 4.1 vs. MP 6.1. I would suspect that either we're generating more spans, or they have more data in them. I assume the throughput in both configurations is roughly the same? If so then to look further into this, we would need:

a count of the number of spans produced in a given time period for both systems
a sample of the data from Jaeger at the end of that time period (IIRC you can export traces from the Jaeger UI in a JSON format)

Lastly, in a heavily loaded production environment, it would be common to use a sampler to export spans for only a small percentage of requests, rather than for every request. For example, to configure tracing of 1% of requests, you would set the following environment variables:

OTEL_TRACES_SAMPLER=parentbased_traceidratio
OTEL_TRACES_SAMPLER_ARG=0.01

This would result in a much lower rate of span production (and consequently lower memory growth if Jaeger is configured to store spans in memory).

hanczaryk commented 1 year ago

In the issue description, I showed the operations I used to deploy jaeger as detailed in the AcmeAir documentation. I can easily attempt another run using the sampler instructions you pasted above.

I will look at the jaeger production deployment type but I'm currently unaware of how to set that up as I was just following AcmeAir documented instructions. Based on your statements, it sounds like SVT's stress runs should be setup in this manner going forward.

OpenLiberty / open-liberty