Closed bolbat closed 9 months ago
If someone would need code to debug this situation I will share my as example Code is very ad hock, just for fast debugging
Add this dependencies to ByteBuddy library
<dependency>
<groupId>net.bytebuddy</groupId>
<artifactId>byte-buddy</artifactId>
</dependency>
<dependency>
<groupId>net.bytebuddy</groupId>
<artifactId>byte-buddy-agent</artifactId>
</dependency>
Instrument related code with something like this Call instrumentForDebug on some early application startup stage
public static void instrumentForDebug() {
ByteBuddyAgent.install();
new ByteBuddy()
.redefine(FastThreadLocal.class)
.visit(Advice.to(FastThreadLocalInstantiationInterceptor.class).on(ElementMatchers.isDefaultConstructor()))
.make()
.load(InternalThreadLocalMap.class.getClassLoader(), ClassReloadingStrategy.fromInstalledAgent())
.getLoaded();
}
public static final class FastThreadLocalInstantiationInterceptor {
public static final Map<String, AtomicInteger> STATS = Maps.mutable.empty();
@OnMethodEnter
public static void intercept() {
final String stackTraceString = ExceptionUtils.getStackTrace(new Throwable());
final AtomicInteger existing = STATS.get(stackTraceString);
if (existing == null) {
STATS.put(stackTraceString, new AtomicInteger(1));
} else {
existing.incrementAndGet();
}
}
}
Prepare some debug information and log it or get by some debug API like I did
private void debugFastThreadLocalInstantiationStats(final StringBuilder sb, final int minInstantiations) {
line(sb.append("FastThreadLocal Instantiation Stats"));
FastThreadLocalInstantiationInterceptor.STATS.forEach((k, v) -> {
final int instantiations = v.get();
if (instantiations >= minInstantiations) {
line(sb.append("Instantiation[").append(instantiations).append("] ")
.append("StackTrace: ").append(k));
line(sb);
}
});
}
private void line(final StringBuilder sb) {
sb.append(System.lineSeparator());
}
And example how OldGen heap looks like with ZGC on Java 21 with this problem
thanks @bolbat
FYI, I think you could the the leak detection of a sync profiler (adding --live to the cmd line options) to detect this or just allocation profiling.
Thanks for the analysis, although by reading this I don't see the problem in the fast thread local, but instead in the high number of combiner executors created: any idea why @vietj ?
Said that, fast thread locals could be "smarter" in resizing the array, if it knows that a combiner is no longer reachable, and could do a better job. This means making it autocloseable and handle thread local dispose somehow and we should have a clean life cycle for it already, just need to use it to enforce thread local disposal
@bolbat And I was very wrong, it seems there's no way to prevent indexes to keep on growing there, so, the fix made by @vietj should be the one to solve AND improve performance in one go (as the original Pr I sent was meant to do, sorry for that :"( )
Confirming that problem was fixed in 4.5.2-SNAPSHOT Heap looks stable, InternalThreadLocalMap map index and size is near 50 elements, WebClient works fine.
Heap OldGen graph for last 2 days from one node:
Version
Vertx: bug since 4.4.x, available on 4.5.1 Netty: version doesn't matter, tested with latest 4.1.106.Final Java: tested on 17 and 21, but could be reproduced on any supported version
Context
FastThreadLocal feature designed to be faster than basic ThreadLocal and internally use InternalThreadLocalMap which internally have:
Size of indexedVariables field depends on a static nextIndex field If someone create a lot of FastThreadLocal instance - this global static index will be increased by one for each instance As result for all InternalThreadLocalMap instances each invocation of setIndexedVariable method will resize indexedVariables array if index is bigger than array size (resizing indexedVariables arrays in all instance because of global static index)
Consequently all FastThreadLocalThread instances (including VertxThread) and FastThreadLocal fields will have continuously increasing indexedVariables arrays in their InternalThreadLocalMap instances till JVM will be restarted
Our case
On our production load one instance serving tens thousands requests per second and doing much more outgoing requests Each instance use 64 physical cores and use 128 threads for Netty event-pool For around 8h uptime each of this 128 threads have own indexedVariables array instance with length 8_000_000+ and use 60MB+ memory 99.999% values in each array instance is empty and filled with InternalThreadLocalMap.UNSET object Usually each thread / array store only 3-15 values from features that use FastThreadLocal in a right way
Heap dump details
What caused this?
After some investigation I've found that there some instances of FastThreadLocal stored as not static fields and could be created a lot of times if parent object also created a lot of times
How to find exactly which code does this?
I've implemented debug feature (with ByteBuddy constructor interceptor) for FastThreadLocal and collected StackTraces stats with information about who creating this instances
It looks like this
- Stats collected for threads who instantiated _FastThreadLocal_ 1000+ times - StackTraces trimmed so as not to take up too much space, but still with all required information to see the problem ``` Instantiation[10177] StackTrace: java.lang.Throwable at io.netty.util.concurrent.FastThreadLocal.Now we see that problem is related to HttpClient component and especially to io.vertx.core.net.impl.pool.CombinerExecutor It's generating a lot of instances and as consequence created a lot of FastThreadLocal instances
Cause of this is the next commit: https://github.com/eclipse-vertx/vert.x/commit/f15cf1ca873a33ed5fe153aba5c4cd4327697954 In scope of this PR: https://github.com/eclipse-vertx/vert.x/pull/4750 And this ticket: https://github.com/eclipse-vertx/vert.x/issues/4749
Additional information
I've checked who potentially can also cause this problem if someone change code without understanding FastThreadLocal implementation details
Already problem or high risk:
Can be a problem:
Good examples how to use FastThreadLocal feature in a safe way (just use as static final fields)