The Java Agent may cause frequent GC or OOM in extreme scenarios

nisiyong commented 3 years ago

Please answer these questions before submitting your issue.

Why do you submit this issue?
- [ ] Question or discussion
- [ ] Bug
- [ ] Requirement
- [x] Feature or performance improvement

Requirement or improvement

SkyWalking Java Agent is a powerful language instrument, it makes us build our tracing system more easily.

We have used SkyWalking with our Java Applications in production serval mouths, it runs fine mostly. Recently, we found some applications occur with frequent GC and some occur OOM. We dump the memory heap and use Memory Analyzer (MAT) find there has a lot of TraceSegmentRef Object in the heap. Here are two cases as follows:

Case 1: Frequency GC

In this case, the app has 1000 Dubbo handler threads, each handler will do a lot RPCs and DB operations.

JVM Max Heap: 8g
Machine: 8 core 16g
SkyWalking Agent: 8.4.0, collect all traces

Case 2: OOM

In this case, the app has 20 RocketMQ consume threads, in the consume thread, it will do some RPCs and DB operations.

JVM Max Heap: 8g
Machine: 8 core 16g
SkyWalking Agent: 8.4.0, collect all traces

On the application side, I think there have 3 reasons:

sudden high throughput will cause all threads busy to handle requests.
each request handle has a lot of RPCs and DB operations, cause create a lot of spans
Handle requests slowly, some will elapse 10s even more.

On the agent side, I have read the source code and know some design:

The Segment in the SkyWalking concept, is the Object in the RingBuffer on the client-side, and SkyWalking has a consume thread consume the RingBuffer data send to the OAP.
Before put the Segment Object in the RingBuffer, will build it first. Each request will create some spans, and there are put in the stack data structure, the Segment will finish building utils the stack empty, which means the request in the application has finished. It will take some time. Meanwhile, the data will keep in the thread-local. And the garbage collector cannot collect them before the request finished.

I wonder why put the segment in the ring buffer, could we put the span? I don't familiar with the Segment design purpose. And I know we should improve our application at the same time, but in some scenarios, people can tolerate it, even though handling requests slowly. So how SkyWalking Java Agent can do in such extreme scenarios? Because the application availability is very important, all of us won't hope the APM instrument occupies a lot of memory.

liqiangz commented 3 years ago

Maybe you can use agent.sample_n_per_3_secs to limit.

wu-sheng commented 3 years ago

Let's talk about this from a different perspective.

So how SkyWalking Java Agent can do in such extreme scenarios? Because the application availability is very important, all of us won't hope the APM instrument occupies a lot of memory.

SkyWalking agent would provide a unlimited size of spans per segment, you could find the configuration in the agent, it would be 300 spans per segment mostly.

I wonder why put the segment in the ring buffer, could we put the span? I don't familiar with the Segment design purpose.

If you don't have this concept, you can't tell the links(metrics) between entry span and exit spans, such as one day, we may need N database operations per HTTP request. Also, your codes actually cache much more data than we did for tracing the context. Unless, like in your case, I would say you must put the parameter collection by yourself.

wu-sheng commented 3 years ago

I want to send a warning to your analysis, this graph shows, you used 409M memory for 8m SegmentRef instances. Unless you could confirm, you have so many RPCs or MQ messages consuming at this point, then you are facing plugin bug and memory leak. Rather than any design or code perspective issues.

nisiyong commented 3 years ago

Maybe you can use agent.sample_n_per_3_secs to limit.

Thank you, I have already set it, and also set the agent.span_limit_per_segment less than 300.

nisiyong commented 3 years ago

Also, your codes actually cache much more data than we did for tracing the context.

But the heap dump shows that most cache data are the SkyWalking Segments.

Unless, like in your case, I would say you must put the parameter collection by yourself.

I do not understand what you mean. Could you tell me more information about that?

Unless you could confirm, you have so many RPCs or MQ messages consuming at this point, then you are facing plugin bug and memory leak. Rather than any design or code perspective issues.

Thank you for your point, I will do more analysis and check the plugins.

wu-sheng commented 3 years ago

I do not understand what you mean. Could you tell me more information about that?

Collect HTTP parameters, SQL parameters, etc.

wu-sheng commented 3 years ago

But the heap dump shows that most cache data are the SkyWalking Segments.

Don't think in this way. Yes, there are so many objects, but do you think what is expected?

WildWolfBang commented 3 years ago

I want to send a warning to your analysis, this graph shows, you used 409M memory for 8m SegmentRef instances.

@wu-sheng Hi,I have noticed that "dominat_tree graph" shows private List<TraceSegmentRef> refs contains about 511360 entries in one TheadLocal, so I expand the "Class Name" until "java.lang.string" or "char[]" to confirm real memory,then it occurs "Shallow Heap" equals "Retained Heap". The list actually used about 356M in one TheadLocal.The situation is the same as other Dubbo threads that are sorted by percentage. The docs said private List<TraceSegmentRef> refs used to link multi parents trace segments, the segment faces hundreds of thousands of parents is unusual unless a loop occurs.

public class TraceSegment {
    private String traceSegmentId;
    /**
     * The refs of parent trace segments, except the primary one. For most RPC call, {@link #refs} contains only one
     * element, but if this segment is a start span of batch process, the segment faces multi parents, at this moment,
     * we use this {@code #refs} to link them.
     * <p>
     * This field will not be serialized. Keeping this field is only for quick accessing.
     */
    private List<TraceSegmentRef> refs;
...
}

https://dzone.com/articles/eclipse-mat-shallow-heap-retained-heap

zifeihan commented 3 years ago

I think the reason for this is that an exception occurred in a plugin, which caused threadLocal to not be cleaned up, resulting in org.apache.skywalking.apm.agent.core.context.trace.TraceSegmentRef constantly being added.

nisiyong commented 3 years ago

I think the reason for this is that an exception occurred in a plugin, which caused threadLocal to not be cleaned up, resulting in org.apache.skywalking.apm.agent.core.context.trace.TraceSegmentRef constantly being added.

Thanks, I will take a look. But you could see the above 2 cases: the 1st one this about Dubbo threads, the 2nd one is about RocketMQ consume threads. I do not believe both plugins have the same question. Anyway, I will do more analysis.

zifeihan commented 3 years ago

Thanks, I will take a look. But you could see the above 2 cases: the 1st one this about Dubbo threads, the 2nd one is about RocketMQ consume threads. I do not believe both plugins have the same question. Anyway, I will do more analysis.

Thank you, I will try to analyze and solve the problem.

nisiyong commented 3 years ago

If this problem is about plugins, let me provide some information here, we use 44 official 8.4.0-plugins and apm-hbase-1.x-2.x-plugin-8.4.0.jar.

We cherry-pick the commit about HBase from the current master. relate #6577

These are all plugins we use:

apm-dubbo-2.7.x-plugin-8.4.0.jar
apm-dubbo-plugin-8.4.0.jar
apm-elasticsearch-5.x-plugin-8.4.0.jar
apm-elasticsearch-6.x-plugin-8.4.0.jar
apm-hbase-1.x-2.x-plugin-8.4.0.jar
apm-httpasyncclient-4.x-plugin-8.4.0.jar
apm-httpclient-3.x-plugin-8.4.0.jar
apm-httpClient-4.x-plugin-8.4.0.jar
apm-jdbc-commons-8.4.0.jar
apm-jedis-2.x-plugin-8.4.0.jar
apm-jetty-client-9.0-plugin-8.4.0.jar
apm-jetty-client-9.x-plugin-8.4.0.jar
apm-jetty-server-9.x-plugin-8.4.0.jar
apm-kafka-commons-8.4.0.jar
apm-kafka-plugin-8.4.0.jar
apm-lettuce-5.x-plugin-8.4.0.jar
apm-mongodb-3.x-plugin-8.4.0.jar
apm-mongodb-4.x-plugin-8.4.0.jar
apm-mysql-5.x-plugin-8.4.0.jar
apm-mysql-6.x-plugin-8.4.0.jar
apm-mysql-8.x-plugin-8.4.0.jar
apm-mysql-commons-8.4.0.jar
apm-okhttp-3.x-plugin-8.4.0.jar
apm-redisson-3.x-plugin-8.4.0.jar
apm-resttemplate-4.3.x-plugin-8.4.0.jar
apm-rocketmq-4.x-plugin-8.4.0.jar
apm-sharding-sphere-4.1.0-plugin-8.4.0.jar
apm-spring-async-annotation-plugin-8.4.0.jar
apm-spring-cloud-feign-1.x-plugin-8.4.0.jar
apm-spring-cloud-feign-2.x-plugin-8.4.0.jar
apm-spring-concurrent-util-4.x-plugin-8.4.0.jar
apm-spring-core-patch-8.4.0.jar
apm-spring-kafka-1.x-plugin-8.4.0.jar
apm-spring-kafka-2.x-plugin-8.4.0.jar
apm-springmvc-annotation-3.x-plugin-8.4.0.jar
apm-springmvc-annotation-4.x-plugin-8.4.0.jar
apm-springmvc-annotation-5.x-plugin-8.4.0.jar
apm-springmvc-annotation-commons-8.4.0.jar
apm-spring-scheduled-annotation-plugin-8.4.0.jar
apm-vertx-core-3.x-plugin-8.4.0.jar
dubbo-2.7.x-conflict-patch-8.4.0.jar
dubbo-conflict-patch-8.4.0.jar
spring-commons-8.4.0.jar
spring-webflux-5.x-webclient-plugin-8.4.0.jar
tomcat-7.x-8.x-plugin-8.4.0.jar

nisiyong commented 3 years ago

@wu-sheng I found that there is no source code read TraceSegmentRef, only the test code read it. The agent adds TraceSegmentRef in the JVM from each request, do we still need TraceSegmentRef now? Only for test code?

wu-sheng commented 3 years ago

@wu-sheng I found that there is no source code read TraceSegmentRef, only the test code read it. The agent adds TraceSegmentRef in the JVM from each request, do we still need TraceSegmentRef now? Only for test code?

Why we are beginning to discuss source code here?

nisiyong commented 3 years ago

Take a easy. Do you mean I should open a new issue to ask another question?

wu-sheng commented 3 years ago

Take a easy. Do you mean I should open a new issue to ask another question?

Yes. If you want to dig deeper purely about codes, such as optimize/polish codes for more clear, more efficiency, let's do it on codes only. Unless you find out, this code relates to this memory leak.

nisiyong commented 3 years ago

Okay, get it.

nisiyong commented 3 years ago

Thanks for @zifeihan help. We found ERROR in some agent logs. And this ERROR due to the missing apm-httpclient-commons-8.4.0.jar in the agent plugins folder.

ERROR 2021-04-02 18:16:42:833 ConsumeMessageThread_4 InstMethodsInter : class[class org.apache.http.impl.client.InternalHttpClient] after metho
d[doExecute] intercept failure
java.lang.NoClassDefFoundError: org/apache/skywalking/apm/plugin/httpclient/HttpClientPluginConfig$Plugin$HttpClient
        at org.apache.skywalking.apm.plugin.httpClient.v4.HttpClientExecuteInterceptor.afterMethod(HttpClientExecuteInterceptor.java:98)
        at org.apache.skywalking.apm.agent.core.plugin.interceptor.enhance.InstMethodsInter.intercept(InstMethodsInter.java:97)
        at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:107)
        at io.searchbox.client.http.JestHttpClient.executeRequest(JestHttpClient.java:136)
        at io.searchbox.client.http.JestHttpClient.execute(JestHttpClient.java:70)
        at io.searchbox.client.http.JestHttpClient.execute(JestHttpClient.java:63)

We use the Maven Profile build agent by ourselves, and follow things make this ERROR:

Missing apm-httpclient-commons module in the build maven reactor. Because I am not aware there has a dependency relationship between plugins.
Forgot to modify the version, still use 8.4.0.Build successful because maven pulls the dependency apm-httpclient-commons from the central repo.

Update version to 8.4.0.1, it builds failed, after adding the module apm-httpclient-commons, it builds successfully. We will deploy the new agent this week and will dump the business application heap again. Let see how it performs after using the new agent. I will record the final result here in few days.

wu-sheng commented 3 years ago

Thanks for the update.

wu-sheng commented 3 years ago

@nisiyong This issue is going to be closed once #6715 gets merged, it has 2 approvals already.

To other people reading this issue, that PR is a precautionary measure, rather than a real bug fix or resolve this particular issue.

nisiyong commented 3 years ago

@nisiyong This issue is going to be closed once #6715 gets merged, it has 2 approvals already.

It is OK, if there have another problem I will open a new issue and relate this issue.

apache / skywalking