Open hanczaryk opened 1 year ago
After the 24 hour run with MP 4.1 and mpTelmetry-1.1 completed, I started a new run with MP 6.1. Using this daily openliberty build from 11/15, the jaeger pod still shows significant growth but MP 4.1 appears to be about double the growth of this recent MP 6.1 run. Here is a OCP console metrics screenshot from the recent MP 6.1 run.
How are you deploying Jaeger? In the default configuration, exported spans are stored in memory and in that case we would expect large continual memory growth. I'm not sure if these are the right docs for the Jaeger operator you're using, but if so then we need to use the production
deployment type which stores the span data in elasticsearch. (While the default allInOne
deployment type is generally used for testing, it's not suitable for a long-running test like this which will generate lots of trace data.)
It is, however, unexpected that you're getting more growth on MP 4.1 vs. MP 6.1. I would suspect that either we're generating more spans, or they have more data in them. I assume the throughput in both configurations is roughly the same? If so then to look further into this, we would need:
Lastly, in a heavily loaded production environment, it would be common to use a sampler to export spans for only a small percentage of requests, rather than for every request. For example, to configure tracing of 1% of requests, you would set the following environment variables:
OTEL_TRACES_SAMPLER=parentbased_traceidratio
OTEL_TRACES_SAMPLER_ARG=0.01
This would result in a much lower rate of span production (and consequently lower memory growth if Jaeger is configured to store spans in memory).
In the issue description, I showed the operations I used to deploy jaeger as detailed in the AcmeAir documentation. I can easily attempt another run using the sampler instructions you pasted above.
I will look at the jaeger production deployment type but I'm currently unaware of how to set that up as I was just following AcmeAir documented instructions. Based on your statements, it sounds like SVT's stress runs should be setup in this manner going forward.
SVT: Jaeger pod showing significant memory increase when executing app load against AcmeAir using mpTelemetry-1.1, microProfile-4.1 and webProfile-8.0 features.
Describe the bug
SVT is testing a 24 hour app load stress run against AcmeAir using mpTelemetry-1.1, microProfile-4.1 and webProfile-8.0 features.
I'm testing using a recent daily openliberty build (wlp-1.0.84.cl231220231115-1101) in which this mpTelemetry-1.1 now tolerates older MP and EE versions.
The jaeger pod memory grows so steadily that it exhausts all memory available on the node. To combat this, I edit the jaeger deployment to specify 8Gi memory limit to avoid the entire node going down.
With this 8Gi memory limit set, jaeger pod exceeds the memory limit in just under an hour during the Acmeair app load.
Here is a screenshot from the OCP console showing the Metrics for my jaeger pod with a significant memory growth over a 30 minute timeframe.
Steps to Reproduce
On an OpenShift cluster:
The following are instructions that work using MicroProfile with mpTelemetry.
Deploy OpenTelemetry Collector Operator
To deploy OpenTelemetry Collector Operator (optional), utilize the OpenShift console to install operator using defaults for 'Community OpenTelemetry Operator'. The following is a screenshot.
After the operator is installed, create instances for OpenTelemetry Instrumentation and OpenTelemetry Collector.
Deploy AcmeAir microservice applications running mpTelemetry-1.1 on Liberty
The 5 AcmeAir microservice repos are located at https://github.com/blueperf/ (choosing the microprofile-4.1 branch)
Follow the instructions from README.md at https://github.com/blueperf/acmeair-mainservice-java to install AcmeAir on OpenShift. To enabled verboseGC for this stress test, you can edit the jvm.options for each of the 5 microservices to include your desired values such as
Deploy Jaeger
To deploy Community Jaeger Operator, utilize the OpenShift console to install operator using default. The following is a screenshot.
After the Community Jaeger Operator is installed, use the OpenShift console to create a Jaeger instance. Before creating the Jaeger instance, switch to the 'acmeair' namespace and use 'jaeger-all-in-one-inmemory' for the name.
Run the applications for a long period of time
Use jmeter to drive AcmeAir application load for 24 hours. While the application load is running, access the Jaeger UI using the jaeger route shown in the OpenShift console. Ensure spans are received on Jaeger for the services.
The following are instruction modifications that work using MicroProfile 4.1 with mpTelemetry.
For the 5 microservice Dockerfile's I set the following env info
Expected behavior
I expect the the jaeger pod memory consumption will at some point plateau and stabilize.
Diagnostic information:
OpenLiberty Version: daily openliberty-all build (wlp-1.0.84.cl231220231115-1101)
Affected feature(s) : mpTelemetry-1.1
Java Version: Acmeair microservices java versions in use are
server.xml configuration (WITHOUT sensitive information like passwords) The following is the server.xml for the acmeair-flightservice. Each of the 5 acmeair microservices are similar with the same features enabled.
If it would be useful, upload the messages.log file found in
$WLP_OUTPUT_DIR/messages.log
The following snippet from acmeair-flightservice's messages.log shows the application starting successfully but eventually logging messages that fail to export spans when the jaeger pod has to restart when it runs out of memory.Additional context
The 5 Acmeair microservices don't show any memory increase and continue to run successfully throughout the 24 hour app load even though the jaeger pod restarts when the 8Gi memory limit is encountered.
I'd run previous AcmeAir app load on previous builds in October using mpTelemetry-1.1 with microProfile-6.1. While jaeger pod showed some memory growth, it wasn't nearly as significant as this test using mpTelemetry-1.1 with microProfile-6.1 and webProfile-8.0.