Closed nisiyong closed 3 years ago
Maybe you can use agent.sample_n_per_3_secs
to limit.
Let's talk about this from a different perspective.
So how SkyWalking Java Agent can do in such extreme scenarios? Because the application availability is very important, all of us won't hope the APM instrument occupies a lot of memory.
SkyWalking agent would provide a unlimited size of spans per segment, you could find the configuration in the agent, it would be 300 spans per segment mostly.
I wonder why put the segment in the ring buffer, could we put the span? I don't familiar with the Segment design purpose.
If you don't have this concept, you can't tell the links(metrics) between entry span and exit spans, such as one day, we may need N database operations per HTTP request. Also, your codes actually cache much more data than we did for tracing the context. Unless, like in your case, I would say you must put the parameter collection by yourself.
I want to send a warning to your analysis, this graph shows, you used 409M memory for 8m SegmentRef instances. Unless you could confirm, you have so many RPCs or MQ messages consuming at this point, then you are facing plugin bug and memory leak. Rather than any design or code perspective issues.
Maybe you can use
agent.sample_n_per_3_secs
to limit.
Thank you, I have already set it, and also set the agent.span_limit_per_segment
less than 300.
Also, your codes actually cache much more data than we did for tracing the context.
But the heap dump shows that most cache data are the SkyWalking Segments.
Unless, like in your case, I would say you must put the parameter collection by yourself.
I do not understand what you mean. Could you tell me more information about that?
Unless you could confirm, you have so many RPCs or MQ messages consuming at this point, then you are facing plugin bug and memory leak. Rather than any design or code perspective issues.
Thank you for your point, I will do more analysis and check the plugins.
I do not understand what you mean. Could you tell me more information about that?
Collect HTTP parameters, SQL parameters, etc.
But the heap dump shows that most cache data are the SkyWalking Segments.
Don't think in this way. Yes, there are so many objects, but do you think what is expected?
I want to send a warning to your analysis, this graph shows, you used 409M memory for 8m SegmentRef instances.
@wu-sheng Hi,I have noticed that "dominat_tree graph" shows private List<TraceSegmentRef> refs
contains about 511360 entries in one TheadLocal, so I expand the "Class Name" until "java.lang.string" or "char[]" to confirm real memory,then it occurs "Shallow Heap" equals "Retained Heap". The list actually used about 356M in one TheadLocal.The situation is the same as other Dubbo threads that are sorted by percentage.
The docs said private List<TraceSegmentRef> refs
used to link multi parents trace segments, the segment faces hundreds of thousands of parents is unusual unless a loop occurs.
public class TraceSegment {
private String traceSegmentId;
/**
* The refs of parent trace segments, except the primary one. For most RPC call, {@link #refs} contains only one
* element, but if this segment is a start span of batch process, the segment faces multi parents, at this moment,
* we use this {@code #refs} to link them.
* <p>
* This field will not be serialized. Keeping this field is only for quick accessing.
*/
private List<TraceSegmentRef> refs;
...
}
https://dzone.com/articles/eclipse-mat-shallow-heap-retained-heap
I think the reason for this is that an exception occurred in a plugin, which caused threadLocal to not be cleaned up, resulting in org.apache.skywalking.apm.agent.core.context.trace.TraceSegmentRef
constantly being added.
I think the reason for this is that an exception occurred in a plugin, which caused threadLocal to not be cleaned up, resulting in
org.apache.skywalking.apm.agent.core.context.trace.TraceSegmentRef
constantly being added.
Thanks, I will take a look. But you could see the above 2 cases: the 1st one this about Dubbo threads, the 2nd one is about RocketMQ consume threads. I do not believe both plugins have the same question. Anyway, I will do more analysis.
Thanks, I will take a look. But you could see the above 2 cases: the 1st one this about Dubbo threads, the 2nd one is about RocketMQ consume threads. I do not believe both plugins have the same question. Anyway, I will do more analysis.
Thank you, I will try to analyze and solve the problem.
If this problem is about plugins, let me provide some information here, we use 44 official 8.4.0-plugins and apm-hbase-1.x-2.x-plugin-8.4.0.jar
.
We cherry-pick the commit about HBase from the current master. relate #6577
These are all plugins we use:
apm-dubbo-2.7.x-plugin-8.4.0.jar
apm-dubbo-plugin-8.4.0.jar
apm-elasticsearch-5.x-plugin-8.4.0.jar
apm-elasticsearch-6.x-plugin-8.4.0.jar
apm-hbase-1.x-2.x-plugin-8.4.0.jar
apm-httpasyncclient-4.x-plugin-8.4.0.jar
apm-httpclient-3.x-plugin-8.4.0.jar
apm-httpClient-4.x-plugin-8.4.0.jar
apm-jdbc-commons-8.4.0.jar
apm-jedis-2.x-plugin-8.4.0.jar
apm-jetty-client-9.0-plugin-8.4.0.jar
apm-jetty-client-9.x-plugin-8.4.0.jar
apm-jetty-server-9.x-plugin-8.4.0.jar
apm-kafka-commons-8.4.0.jar
apm-kafka-plugin-8.4.0.jar
apm-lettuce-5.x-plugin-8.4.0.jar
apm-mongodb-3.x-plugin-8.4.0.jar
apm-mongodb-4.x-plugin-8.4.0.jar
apm-mysql-5.x-plugin-8.4.0.jar
apm-mysql-6.x-plugin-8.4.0.jar
apm-mysql-8.x-plugin-8.4.0.jar
apm-mysql-commons-8.4.0.jar
apm-okhttp-3.x-plugin-8.4.0.jar
apm-redisson-3.x-plugin-8.4.0.jar
apm-resttemplate-4.3.x-plugin-8.4.0.jar
apm-rocketmq-4.x-plugin-8.4.0.jar
apm-sharding-sphere-4.1.0-plugin-8.4.0.jar
apm-spring-async-annotation-plugin-8.4.0.jar
apm-spring-cloud-feign-1.x-plugin-8.4.0.jar
apm-spring-cloud-feign-2.x-plugin-8.4.0.jar
apm-spring-concurrent-util-4.x-plugin-8.4.0.jar
apm-spring-core-patch-8.4.0.jar
apm-spring-kafka-1.x-plugin-8.4.0.jar
apm-spring-kafka-2.x-plugin-8.4.0.jar
apm-springmvc-annotation-3.x-plugin-8.4.0.jar
apm-springmvc-annotation-4.x-plugin-8.4.0.jar
apm-springmvc-annotation-5.x-plugin-8.4.0.jar
apm-springmvc-annotation-commons-8.4.0.jar
apm-spring-scheduled-annotation-plugin-8.4.0.jar
apm-vertx-core-3.x-plugin-8.4.0.jar
dubbo-2.7.x-conflict-patch-8.4.0.jar
dubbo-conflict-patch-8.4.0.jar
spring-commons-8.4.0.jar
spring-webflux-5.x-webclient-plugin-8.4.0.jar
tomcat-7.x-8.x-plugin-8.4.0.jar
@wu-sheng
I found that there is no source code read TraceSegmentRef
, only the test code read it.
The agent adds TraceSegmentRef
in the JVM from each request, do we still need TraceSegmentRef
now? Only for test code?
@wu-sheng I found that there is no source code read
TraceSegmentRef
, only the test code read it. The agent addsTraceSegmentRef
in the JVM from each request, do we still needTraceSegmentRef
now? Only for test code?
Why we are beginning to discuss source code here?
Take a easy. Do you mean I should open a new issue to ask another question?
Take a easy. Do you mean I should open a new issue to ask another question?
Yes. If you want to dig deeper purely about codes, such as optimize/polish codes for more clear, more efficiency, let's do it on codes only. Unless you find out, this code relates to this memory leak.
Okay, get it.
Thanks for @zifeihan help. We found ERROR in some agent logs. And this ERROR due to the missing apm-httpclient-commons-8.4.0.jar
in the agent plugins folder.
ERROR 2021-04-02 18:16:42:833 ConsumeMessageThread_4 InstMethodsInter : class[class org.apache.http.impl.client.InternalHttpClient] after metho
d[doExecute] intercept failure
java.lang.NoClassDefFoundError: org/apache/skywalking/apm/plugin/httpclient/HttpClientPluginConfig$Plugin$HttpClient
at org.apache.skywalking.apm.plugin.httpClient.v4.HttpClientExecuteInterceptor.afterMethod(HttpClientExecuteInterceptor.java:98)
at org.apache.skywalking.apm.agent.core.plugin.interceptor.enhance.InstMethodsInter.intercept(InstMethodsInter.java:97)
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:107)
at io.searchbox.client.http.JestHttpClient.executeRequest(JestHttpClient.java:136)
at io.searchbox.client.http.JestHttpClient.execute(JestHttpClient.java:70)
at io.searchbox.client.http.JestHttpClient.execute(JestHttpClient.java:63)
We use the Maven Profile build agent by ourselves, and follow things make this ERROR:
apm-httpclient-commons
module in the build maven reactor. Because I am not aware there has a dependency relationship between plugins.8.4.0
.Build successful because maven pulls the dependency apm-httpclient-commons
from the central repo.Update version to 8.4.0.1
, it builds failed, after adding the module apm-httpclient-commons
, it builds successfully.
We will deploy the new agent this week and will dump the business application heap again. Let see how it performs after using the new agent. I will record the final result here in few days.
Thanks for the update.
@nisiyong This issue is going to be closed once #6715 gets merged, it has 2 approvals already.
To other people reading this issue, that PR is a precautionary measure, rather than a real bug fix or resolve this particular issue.
@nisiyong This issue is going to be closed once #6715 gets merged, it has 2 approvals already.
It is OK, if there have another problem I will open a new issue and relate this issue.
Please answer these questions before submitting your issue.
Requirement or improvement
SkyWalking Java Agent is a powerful language instrument, it makes us build our tracing system more easily.
We have used SkyWalking with our Java Applications in production serval mouths, it runs fine mostly. Recently, we found some applications occur with frequent GC and some occur OOM. We dump the memory heap and use Memory Analyzer (MAT) find there has a lot of
TraceSegmentRef
Object in the heap. Here are two cases as follows:Case 1: Frequency GC
In this case, the app has 1000 Dubbo handler threads, each handler will do a lot RPCs and DB operations.
Case 2: OOM
In this case, the app has 20 RocketMQ consume threads, in the consume thread, it will do some RPCs and DB operations.
On the application side, I think there have 3 reasons:
On the agent side, I have read the source code and know some design:
Segment
in the SkyWalking concept, is the Object in the RingBuffer on the client-side, and SkyWalking has a consume thread consume the RingBuffer data send to the OAP.Segment
Object in the RingBuffer, will build it first. Each request will create some spans, and there are put in the stack data structure, theSegment
will finish building utils the stack empty, which means the request in the application has finished. It will take some time. Meanwhile, the data will keep in the thread-local. And the garbage collector cannot collect them before the request finished.I wonder why put the segment in the ring buffer, could we put the span? I don't familiar with the Segment design purpose. And I know we should improve our application at the same time, but in some scenarios, people can tolerate it, even though handling requests slowly. So how SkyWalking Java Agent can do in such extreme scenarios? Because the application availability is very important, all of us won't hope the APM instrument occupies a lot of memory.