Closed tinystorm closed 6 months ago
You need to see what is holding references to the Http2Connections.
@shawkins Sorry, I didn't reply promptly. I just finished my vacation. It seems like that ReaderRunnable is the gc root of connections.
Are there any other references to the Http2Connection instances besides ReaderRunnable? Are you using an OkHttp ConnectionPool for example?
If there is nothing else obviously holding on to the references, then you'll need to provide more of your code or a reproducer so we can see what code path might be leaving a connection open. We had something like this in the past with http2 https://github.com/fabric8io/kubernetes-client/pull/4665 - but have not encountered anything like that in a while.
Based on the heap, there isn't any other references pointing to Http2Connection
.
And I'm not using a pool (if the default KubernetesClient
is not using it).
Here is my code.
This is a simple factory for the client.
public class K8sClientFactory {
private static final int TOS_CLIENT_RETRY_BACKOFF_LIMIT = 3;
public KubernetesClient get() {
Config config;
if (settingService.getBool(SettingItem.K8S_USE_EXISTING)) {
String configFile = settingService.getString(SettingItem.K8S_CONFIG_FILE);
config = getK8sConfig(configFile);
} else {
config = getTosConfig();
}
return new KubernetesClientBuilder().withConfig(config).build();
}
public Config getK8sConfig(String configFile) {
try {
String content = String.join("\n", Files.readAllLines(new File(configFile).toPath()));
return Config.fromKubeconfig(content);
} catch (IOException e) {
throw new RuntimeException(e);
}
}
private Config getTosConfig() {
String tosMasterUrl = settingService.getString(SettingItem.SERVICE_TOS_MASTER_URL);
if (StringUtils.isBlank(tosMasterUrl)) {
String haproxyPort =
svcService
.getGlobalServiceOpt(ServiceType.TOS)
.flatMap(
tos ->
serviceConfigService.getServiceConfigValueOpt(
tos.getId(), ConfigItem.TOS_HAPROXY_PORT.getKey()))
.orElse(ConfigItem.TOS_HAPROXY_PORT.getDefaultValue());
tosMasterUrl = String.format("https://127.0.0.1:%s/", haproxyPort);
}
return new ConfigBuilder()
.withMasterUrl(tosMasterUrl)
.withCaCertFile(pathProps.getTosCertDir().resolve(TosWrapper.CA_FILE).toString())
.withClientCertFile(
pathProps.getTosCertDir().resolve(TosWrapper.ADMIN_CERT_FILE).toString())
.withClientKeyFile(pathProps.getTosCertDir().resolve(TosWrapper.ADMIN_KEY_FILE).toString())
.withRequestRetryBackoffLimit(TOS_CLIENT_RETRY_BACKOFF_LIMIT)
.build();
}
}
In this environment, it should be using the `TosConfig`
2. These are two typical usage scenarios:
Use the client to execute some commands.
public class PodExecutor extends BaseExecutor { private final KubernetesClient k8sClient; private final String podName; private final String namespace;
protected PodExecutor(KubernetesClient k8sClient, String namespace, String podName) { super(false); this.k8sClient = k8sClient; this.podName = podName; this.namespace = namespace; }
public static PodExecutor create(KubernetesClient k8s, PodSummary podSummary) { return new PodExecutor(k8s, podSummary.getNamespace(), podSummary.getName()); }
@Override public int executeWithOutput(Writer stdout, Writer stderr, String cmd, long timeoutInMinutes) throws IOException { ByteArrayOutputStream out = new ByteArrayOutputStream(); ByteArrayOutputStream error = new ByteArrayOutputStream(); try (ExecWatch execWatch = k8sClient .pods() .inNamespace(namespace) .withName(podName) .writingOutput(out) .writingError(error) .exec("/bin/bash", "-c", cmd)) { int exitCode = execWatch.exitCode().get(timeoutInMinutes, TimeUnit.MINUTES); IOUtils.write(out.toByteArray(), stdout, StandardCharsets.UTF_8); IOUtils.write(error.toByteArray(), stderr, StandardCharsets.UTF_8); return exitCode; } catch (InterruptedException e) { log.error("Pod executor is interrupted.", e); Thread.currentThread().interrupt(); throw new AppException("Executor is interrupted.", e); } catch (ExecutionException | TimeoutException e) { throw new AppException(e); } } }
`PodExecutor` does not close the client, but the client is closed outside.
try (KubernetesClient tosClient = tosWrapper.getTosClient()) {
List<Callable<ExecutionResult>> tasks =
Seq.seq(fromPods)
.map(
fromPod ->
(Callable<ExecutionResult>)
() -> PodExecutor.create(tosClient, fromPod).execute(renderedCommand))
.toList();
List<Future<ExecutionResult>> futures = defaultExecutor.invokeAll(tasks);
//other code...
} catch (InterruptedException e) {
log.error("Thread is interrupted.", e);
Thread.currentThread().interrupt();
throw new InvalidGroupedRoleCheckException(
HealthCheckMessageKeys.GENERAL_ERROR, "Thread is interrupted.", e);
} catch (Exception e) {
log.error("Failed to check role.", e);
throw new InvalidGroupedRoleCheckException(
HealthCheckMessageKeys.GENERAL_ERROR, "Failed to check.", e);
}
Use client to query all pods by pagenation
@Cacheable(value = CacheConfig.PODS_CACHE_NAME, key = "#namespace", sync = true)
public List
private Tuple2<String, List
PodList podList = tosClient.pods().inNamespace(namespace).list(listOptions);
return Tuple.tuple(
podList.getMetadata().getContinue(),
Seq.seq(podList.getItems()).map(KubeUtils::toPodSummary).toList());
}
A couple of thoughts:
Previously, I used a global client and found that after many times queries, I also encountered OOM errors. After analysis, I discovered that the KubernetesClient
was retaining a large number of query results (possibly for each query) without releasing the memory. That's why I made the decision to close the client after each use. Do you have any suggestions for addressing the OOM issue caused by not closing the client for a long time?
Do you have any suggestions for addressing the OOM issue caused by not closing the client for a long time?
It's impossible to say from just this description. It could range from:
If it's not a usage error, then you can try one of the other client types to see if the behavior changes.
Okay,maybe I should use a client pool in some form. Anyway, thank you very much for your answer. You may close if you want.
Okay,maybe I should use a client pool in some form.
You don't need a pool of KubernetesClients - just 1 for a given configuration / cluster. All of the http clients underneath the kubernetesclient use connection pooling.
Please double check that any InputStreams and Readers you obtain from the KuberenetesClient are getting closed.
Testing this out locally seems to confirm some of what you are observing - these connections survive eviction from the pool because it has active allocations (open streams). However the ConnectionPool should not get garbage collected and should still have a reference to the connection. This is because there should be a thread called "OkHttp ConnectionPool" running holding a reference to it - and it should at 5 minute intervals checked for orphaned allocations and emit messages like "Did you forget to close a response body?".
One thing we can consider is adding these streams to our internal closure list to ensure they are cleaned up sooner than 5 minutes.
Please double check that any InputStreams and Readers you obtain from the KuberenetesClient are getting closed.
When I only used a single global client, the scenarios were limited to querying a list of Pods (without using pagination) and createOrReplace
resources. As far as I know, neither of these scenarios will open a stream. Does this mean I don't need to perform any closing operations?
I will retry using one global client to handle everything again. If I encounter the OOM issue once more, I will continue to follow up on this issue.
Thank you once again.
As far as I know, neither of these scenarios will open a stream. Does this mean I don't need to perform any closing operations?
Correct. Neither of those operations maintains an on-going stream with the api server.
Marking as closed until there is more information.
Hi, I experience the same Error. Will try to use single client per app, but the examples here are confusing, if client is closable we should use it in try block with resource, as it shown in examples, which means after every flow is completed(exits the try block) the resource should be closed and created next time.
Will try to use single client per app
@yan-v That should not fully resolve your issue if you are experiencing the same behavior as @tinystorm, and without a further reproducer it's hard to say exactly what is going on. If you are able to provide one that would be great.
but the examples here are confusing, if client is closable we should use it in try block with resource
The examples do not stress client reuse, that is correct - that should be covered in other parts of the docs and is certainly handled for you when you use the client as part of a platform, like quarkus.
If you see a place where additional comments / docs would help, please open an issue.
which means after every flow is completed(exits the try block) the resource should be closed and created next time.
The client implements Closeable because it exposes a close method - it does not require you to use it that way in a try catch block. I'm assuming that the examples were written the way they are so that they read as free-standing rather than showing injection of or separate lifecycle handling of the client.
Please double check that any InputStreams and Readers you obtain from the KuberenetesClient are getting closed. @shawkins Suddenly I realized that my API Server is based on haproxy proxy. Could this potentially cause connections to not be released?
Suddenly I realized that my API Server is based on haproxy proxy. Could this potentially cause connections to not be released?
I am not sure - from a client perspective I'd hope that the connection at least returns to the pool, and that the job to cleanup orphaned allocations / streams works regardless of what is fronting the api server.
The other things to keep in mind are - what you are seeing could be http2 specific - are you able to use http1 instead? And what you are seeing could be okhttp specific - using a different httpclient with the kubernetes client might clarify this, or could highlight more clearly what is being left open.
Will try to use single client per app
@yan-v That should not fully resolve your issue if you are experiencing the same behavior as @tinystorm, and without a further reproducer it's hard to say exactly what is going on. If you are able to provide one that would be great.
but the examples here are confusing, if client is closable we should use it in try block with resource
The examples do not stress client reuse, that is correct - that should be covered in other parts of the docs and is certainly handled for you when you use the client as part of a platform, like quarkus.
If you see a place where additional comments / docs would help, please open an issue.
which means after every flow is completed(exits the try block) the resource should be closed and created next time.
The client implements Closeable because it exposes a close method - it does not require you to use it that way in a try catch block. I'm assuming that the examples were written the way they are so that they read as free-standing rather than showing injection of or separate lifecycle handling of the client.
This is the error I got(a different trace than in initial post):
callStack="io.fabric8.kubernetes.client.KubernetesClientException: Java heap space
at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult(OperationSupport.java:509)
at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleResponse(OperationSupport.java:524)
at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleGet(OperationSupport.java:467)
at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.handleGet(BaseOperation.java:792)
at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.requireFromServer(BaseOperation.java:193)
at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.get(BaseOperation.java:149)
at io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.scale(HasMetadataOperation.java:293)
at io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.scale(HasMetadataOperation.java:44)
The simplified code I used before making k8sClient a singletone:
try (KubernetesClient k8sClient = new KubernetesClientBuilder().build()) {
String k8sNamespace = "MY_NAMESPACE";
k8sClient.apps().deployments().inNamespace(k8sNamespace).withName(deploymentName).scale(someMyNumber);
} catch (Exception e) {
...
}
The error appears not immediately, but after a few executions and about a hour or two, even if service was idle in those hours.
Thank you for your support!
This is the error I got(a different trace than in initial post):
Unfortunately just the stacktrace is not enough - at least a heapdump to confirm what is being held in memory, and then if needed more reproduction steps.
The error appears not immediately, but after a few executions and about a hour or two, even if service was idle in those hours.
Start with a heap dump and see what is being held - if possible also try the alternatives mentioned in https://github.com/fabric8io/kubernetes-client/issues/5970#issuecomment-2110023598 - that should narrow things down as much as possible to where the problem lies.
This is the error I got(a different trace than in initial post):
Unfortunately just the stacktrace is not enough - at least a heapdump to confirm what is being held in memory, and then if needed more reproduction steps.
The error appears not immediately, but after a few executions and about a hour or two, even if service was idle in those hours.
Start with a heap dump and see what is being held - if possible also try the alternatives mentioned in #5970 (comment) - that should narrow things down as much as possible to where the problem lies.
Luckily, client as a singleton works for now, i'll keep it as is. In case we'll experience the issue again I'll provide all the info.
Thank you again for your quick responses!
Describe the bug
I'm using the Kubernetes client to query and operate pods, not sure why I'm experiencing OOM (Out of Memory) errors. I am executing commands and querying the complete list of pods (with caching) within a pod using a scheduled approach. It seems that the OOM issue is not directly caused by the frequency of queries, as i haven't encountered the problem in larger environments with higher query and operation frequencies. Based on the memory analysis, it appears that a large number of Http2Connection objects are not being released, causing them to occupy a significant portion of the memory. But I am confident that I am closing each client immediately after using it.
Note that my Kubernetes service is proxied through HAProxy and distributed to three API servers.
Fabric8 Kubernetes Client version
6.10.0
Steps to reproduce
The logic can be simplified into a loop.
Expected behavior
No OOM
Runtime
Kubernetes (vanilla)
Kubernetes API Server version
other (please specify in additional context)
Environment
Linux
Fabric8 Kubernetes Client Logs
Additional context
k8s version is 1.16.1. If you need more infomation please let me know.