OOM error when using the Kubernetes client to query and operate the API server.

tinystorm commented 6 months ago

Describe the bug

I'm using the Kubernetes client to query and operate pods, not sure why I'm experiencing OOM (Out of Memory) errors. I am executing commands and querying the complete list of pods (with caching) within a pod using a scheduled approach. It seems that the OOM issue is not directly caused by the frequency of queries, as i haven't encountered the problem in larger environments with higher query and operation frequencies. Based on the memory analysis, it appears that a large number of Http2Connection objects are not being released, causing them to occupy a significant portion of the memory. But I am confident that I am closing each client immediately after using it.

Note that my Kubernetes service is proxied through HAProxy and distributed to three API servers.

Fabric8 Kubernetes Client version

6.10.0

Steps to reproduce

The logic can be simplified into a loop.

Creating a client
Using the client (can be query or opreate)
Closing the client
Do other things

Expected behavior

No OOM

Runtime

Kubernetes (vanilla)

Kubernetes API Server version

other (please specify in additional context)

Environment

Linux

Fabric8 Kubernetes Client Logs

2024-04-29 11:08:41,760 ERROR [CachedSingleThreadScheduler-2133446020-pool-6016204-thread-1] i.f.k.c.d.internal.ExecWebSocketListener: Exec Failure
java.util.concurrent.TimeoutException: null
 at io.fabric8.kubernetes.client.utils.AsyncUtils.lambda$withTimeout$0(AsyncUtils.java:42)
 at io.fabric8.kubernetes.client.utils.Utils.lambda$schedule$6(Utils.java:473)
 at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
 at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
 at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
 at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
 at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
 at java.base/java.lang.Thread.run(Thread.java:829)
2024-04-29 11:08:41,794 ERROR [OkHttp https://127.0.0.1:6443/...] i.f.k.c.d.internal.ExecWebSocketListener: Exec Failure
java.util.concurrent.RejectedExecutionException: null
 at io.fabric8.kubernetes.client.utils.internal.SerialExecutor.execute(SerialExecutor.java:48)
 at java.base/java.util.concurrent.CompletableFuture.asyncRunStage(CompletableFuture.java:1750)
 at java.base/java.util.concurrent.CompletableFuture.runAsync(CompletableFuture.java:1959)
 at io.fabric8.kubernetes.client.dsl.internal.ExecWebSocketListener.asyncWrite(ExecWebSocketListener.java:191)
 at io.fabric8.kubernetes.client.dsl.internal.ExecWebSocketListener.lambda$createStream$2(ExecWebSocketListener.java:185)
 at io.fabric8.kubernetes.client.dsl.internal.ExecWebSocketListener$ListenerStream.handle(ExecWebSocketListener.java:113)
 at io.fabric8.kubernetes.client.dsl.internal.ExecWebSocketListener$ListenerStream.access$300(ExecWebSocketListener.java:99)
 at io.fabric8.kubernetes.client.dsl.internal.ExecWebSocketListener.onMessage(ExecWebSocketListener.java:314)
 at io.fabric8.kubernetes.client.okhttp.OkHttpWebSocketImpl$1.onMessage(OkHttpWebSocketImpl.java:110)
 at okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.kt:338)
 at okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.kt:247)
 at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.kt:106)
 at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.kt:293)
 at okhttp3.internal.ws.RealWebSocket$connect$1.onResponse(RealWebSocket.kt:195)
 at okhttp3.internal.connection.RealCall$AsyncCall.run(RealCall.kt:519)
 at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
 at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
 at java.base/java.lang.Thread.run(Thread.java:829)
2024-04-29 11:16:51,412 ERROR [scheduling-12] org.quartz.core.JobRunShell: Job HealthCheck.Service:18:VITAL_SIGN_CHECK threw an unhandled Exception: 
java.lang.OutOfMemoryError: Java heap space
2024-04-29 11:16:55,828 ERROR [MessageBroker-5] o.s.s.s.TaskUtils$LoggingErrorHandler: Unexpected error occurred in scheduled task
java.lang.OutOfMemoryError: Java heap space
2024-04-29 11:16:51,415 ERROR [CachedSingleThreadScheduler-2133446020-pool-6016399-thread-2] i.f.k.c.d.internal.ExecWebSocketListener: Exec Failure
java.util.concurrent.TimeoutException: null
 at io.fabric8.kubernetes.client.utils.AsyncUtils.lambda$withTimeout$0(AsyncUtils.java:42)
 at io.fabric8.kubernetes.client.utils.Utils.lambda$schedule$6(Utils.java:473)
 at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
 at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
 at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
 at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
 at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
 at java.base/java.lang.Thread.run(Thread.java:829)
2024-04-29 11:16:50,638 ERROR [MessageBroker-28] o.s.s.s.TaskUtils$LoggingErrorHandler: Unexpected error occurred in scheduled task
java.lang.OutOfMemoryError: Java heap space
2024-04-29 11:16:50,638 ERROR [scheduling-25] o.s.s.s.TaskUtils$LoggingErrorHandler: Unexpected error occurred in scheduled task
java.lang.OutOfMemoryError: Java heap space
2024-04-29 11:16:55,828 ERROR [Catalina-utility-2] o.apache.coyote.http11.Http11NioProtocol: Error processing async timeouts
java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Java heap space
 at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
 at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
 at org.apache.coyote.AbstractProtocol.startAsyncTimeout(AbstractProtocol.java:681)
 at org.apache.coyote.AbstractProtocol.lambda$start$0(AbstractProtocol.java:667)
 at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
 at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
 at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
 at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
 at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
 at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
 at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.OutOfMemoryError: Java heap space
2024-04-29 11:16:50,638 ERROR [scheduling-40] o.s.s.s.TaskUtils$LoggingErrorHandler: Unexpected error occurred in scheduled task
java.lang.OutOfMemoryError: Java heap space
2024-04-29 11:16:55,829 ERROR [scheduling-41] org.quartz.core.JobRunShell: Job HealthCheck.Service:7:VITAL_SIGN_CHECK threw an unhandled Exception: 
java.lang.OutOfMemoryError: Java heap space
2024-04-29 11:16:55,829 ERROR [scheduling-20] org.quartz.core.JobRunShell: Job HealthCheck.Service:8:VITAL_SIGN_CHECK threw an unhandled Exception: 
java.lang.OutOfMemoryError: Java heap space
2024-04-29 11:16:55,829 ERROR [scheduling-41] org.quartz.core.ErrorLogger: Job (HealthCheck.Service:7:VITAL_SIGN_CHECK threw an exception.
org.quartz.SchedulerException: Job threw an unhandled exception.
 at org.quartz.core.JobRunShell.run(JobRunShell.java:213)
 at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54)
 at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
 at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
 at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
 at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
 at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
 at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.OutOfMemoryError: Java heap space
2024-04-29 11:16:55,829 ERROR [scheduling-12] org.quartz.core.ErrorLogger: Job (HealthCheck.Service:18:VITAL_SIGN_CHECK threw an exception.
org.quartz.SchedulerException: Job threw an unhandled exception.
 at org.quartz.core.JobRunShell.run(JobRunShell.java:213)
 at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54)
 at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
 at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
 at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
 at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
 at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
 at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.OutOfMemoryError: Java heap space

Additional context

k8s version is 1.16.1. If you need more infomation please let me know.

shawkins commented 6 months ago

You need to see what is holding references to the Http2Connections.

tinystorm commented 6 months ago

@shawkins Sorry, I didn't reply promptly. I just finished my vacation. It seems like that ReaderRunnable is the gc root of connections.

shawkins commented 6 months ago

Are there any other references to the Http2Connection instances besides ReaderRunnable? Are you using an OkHttp ConnectionPool for example?

If there is nothing else obviously holding on to the references, then you'll need to provide more of your code or a reproducer so we can see what code path might be leaving a connection open. We had something like this in the past with http2 https://github.com/fabric8io/kubernetes-client/pull/4665 - but have not encountered anything like that in a while.

tinystorm commented 6 months ago

Based on the heap, there isn't any other references pointing to Http2Connection. And I'm not using a pool (if the default KubernetesClient is not using it). Here is my code.

This is a simple factory for the client.


public class K8sClientFactory {

private static final int TOS_CLIENT_RETRY_BACKOFF_LIMIT = 3;

public KubernetesClient get() {
Config config;
if (settingService.getBool(SettingItem.K8S_USE_EXISTING)) {
  String configFile = settingService.getString(SettingItem.K8S_CONFIG_FILE);
  config = getK8sConfig(configFile);
} else {
  config = getTosConfig();
}
return new KubernetesClientBuilder().withConfig(config).build();
}

public Config getK8sConfig(String configFile) {
try {
  String content = String.join("\n", Files.readAllLines(new File(configFile).toPath()));
  return Config.fromKubeconfig(content);
} catch (IOException e) {
  throw new RuntimeException(e);
}
}

private Config getTosConfig() {
String tosMasterUrl = settingService.getString(SettingItem.SERVICE_TOS_MASTER_URL);
if (StringUtils.isBlank(tosMasterUrl)) {
  String haproxyPort =
      svcService
          .getGlobalServiceOpt(ServiceType.TOS)
          .flatMap(
              tos ->
                  serviceConfigService.getServiceConfigValueOpt(
                      tos.getId(), ConfigItem.TOS_HAPROXY_PORT.getKey()))
          .orElse(ConfigItem.TOS_HAPROXY_PORT.getDefaultValue());
  tosMasterUrl = String.format("https://127.0.0.1:%s/", haproxyPort);
}
return new ConfigBuilder()
    .withMasterUrl(tosMasterUrl)
    .withCaCertFile(pathProps.getTosCertDir().resolve(TosWrapper.CA_FILE).toString())
    .withClientCertFile(
        pathProps.getTosCertDir().resolve(TosWrapper.ADMIN_CERT_FILE).toString())
    .withClientKeyFile(pathProps.getTosCertDir().resolve(TosWrapper.ADMIN_KEY_FILE).toString())
    .withRequestRetryBackoffLimit(TOS_CLIENT_RETRY_BACKOFF_LIMIT)
    .build();
}

}

In this environment, it should be using the `TosConfig`

2. These are two typical usage scenarios:

Use the client to execute some commands.

public class PodExecutor extends BaseExecutor { private final KubernetesClient k8sClient; private final String podName; private final String namespace;

protected PodExecutor(KubernetesClient k8sClient, String namespace, String podName) { super(false); this.k8sClient = k8sClient; this.podName = podName; this.namespace = namespace; }

public static PodExecutor create(KubernetesClient k8s, PodSummary podSummary) { return new PodExecutor(k8s, podSummary.getNamespace(), podSummary.getName()); }

@Override public int executeWithOutput(Writer stdout, Writer stderr, String cmd, long timeoutInMinutes) throws IOException { ByteArrayOutputStream out = new ByteArrayOutputStream(); ByteArrayOutputStream error = new ByteArrayOutputStream(); try (ExecWatch execWatch = k8sClient .pods() .inNamespace(namespace) .withName(podName) .writingOutput(out) .writingError(error) .exec("/bin/bash", "-c", cmd)) { int exitCode = execWatch.exitCode().get(timeoutInMinutes, TimeUnit.MINUTES); IOUtils.write(out.toByteArray(), stdout, StandardCharsets.UTF_8); IOUtils.write(error.toByteArray(), stderr, StandardCharsets.UTF_8); return exitCode; } catch (InterruptedException e) { log.error("Pod executor is interrupted.", e); Thread.currentThread().interrupt(); throw new AppException("Executor is interrupted.", e); } catch (ExecutionException | TimeoutException e) { throw new AppException(e); } } }

`PodExecutor` does not close the client, but the client is closed outside.

try (KubernetesClient tosClient = tosWrapper.getTosClient()) {
  List<Callable<ExecutionResult>> tasks =
      Seq.seq(fromPods)
          .map(
              fromPod ->
                  (Callable<ExecutionResult>)
                      () -> PodExecutor.create(tosClient, fromPod).execute(renderedCommand))
          .toList();
  List<Future<ExecutionResult>> futures = defaultExecutor.invokeAll(tasks);
  //other code...
} catch (InterruptedException e) {
  log.error("Thread is interrupted.", e);
  Thread.currentThread().interrupt();
  throw new InvalidGroupedRoleCheckException(
      HealthCheckMessageKeys.GENERAL_ERROR, "Thread is interrupted.", e);
} catch (Exception e) {
  log.error("Failed to check role.", e);
  throw new InvalidGroupedRoleCheckException(
      HealthCheckMessageKeys.GENERAL_ERROR, "Failed to check.", e);
}

Use client to query all pods by pagenation

@Cacheable(value = CacheConfig.PODS_CACHE_NAME, key = "#namespace", sync = true) public List getPods(String namespace) { try (KubernetesClient tosClient = k8sClientFactory.get()) { List results = new ArrayList<>(); long limit = settingService.getInt(SettingItem.SERVICE_K8S_RESOURCE_LIST_LIMIT); String listContinue = null; log.debug("Start to list pods using pagination."); for (int i = 0; ; i++) { Tuple2<String, List> pods = getPods(tosClient, namespace, limit, listContinue); listContinue = pods.v1; log.debug( "On the {}th pod query, {} records were retrieved, continuation value is {}.", i, pods.v2.size(), listContinue); results.addAll(pods.v2); if (listContinue == null) { break; } } return results; } }

private Tuple2<String, List> getPods( KubernetesClient tosClient, String namespace, long limit, String listContinue) { ListOptions listOptions = new ListOptionsBuilder().withLimit(limit).withContinue(listContinue).build();

PodList podList = tosClient.pods().inNamespace(namespace).list(listOptions);
return Tuple.tuple(
    podList.getMetadata().getContinue(),
    Seq.seq(podList.getItems()).map(KubeUtils::toPodSummary).toList());

}

shawkins commented 6 months ago

A couple of thoughts:

kubernetes clients are heavy weight entities, you should reuse it as much as possible rather than creating a new one for every operation. That of course would lessen the issue are seeing, but it would still be good to understand what is going on.
I can't reproduce what you are seeing, locally when client.close is called the ensuing call to connectionPool.evictAll() shuts down all of the ReaderRunnable - maybe you could debug things a little more to what is happening here https://github.com/fabric8io/kubernetes-client/blob/e90b358d67ac884256ff012cc817b0f327d23309/httpclient-okhttp/src/main/java/io/fabric8/kubernetes/client/okhttp/OkHttpClientImpl.java#L261
http1 is used for the websocket calls, so this behavior isn't directly related to the usage of exec

tinystorm commented 6 months ago

Previously, I used a global client and found that after many times queries, I also encountered OOM errors. After analysis, I discovered that the KubernetesClient was retaining a large number of query results (possibly for each query) without releasing the memory. That's why I made the decision to close the client after each use. Do you have any suggestions for addressing the OOM issue caused by not closing the client for a long time?

shawkins commented 6 months ago

Do you have any suggestions for addressing the OOM issue caused by not closing the client for a long time?

It's impossible to say from just this description. It could range from:

a usage error, such as opening log or other streams and never closing them
to a bug, in particular with http2 and okhttp

If it's not a usage error, then you can try one of the other client types to see if the behavior changes.

tinystorm commented 6 months ago

Okay，maybe I should use a client pool in some form. Anyway, thank you very much for your answer. You may close if you want.

shawkins commented 6 months ago

Okay，maybe I should use a client pool in some form.

You don't need a pool of KubernetesClients - just 1 for a given configuration / cluster. All of the http clients underneath the kubernetesclient use connection pooling.

Please double check that any InputStreams and Readers you obtain from the KuberenetesClient are getting closed.

Testing this out locally seems to confirm some of what you are observing - these connections survive eviction from the pool because it has active allocations (open streams). However the ConnectionPool should not get garbage collected and should still have a reference to the connection. This is because there should be a thread called "OkHttp ConnectionPool" running holding a reference to it - and it should at 5 minute intervals checked for orphaned allocations and emit messages like "Did you forget to close a response body?".

One thing we can consider is adding these streams to our internal closure list to ensure they are cleaned up sooner than 5 minutes.

tinystorm commented 6 months ago

Please double check that any InputStreams and Readers you obtain from the KuberenetesClient are getting closed.

When I only used a single global client, the scenarios were limited to querying a list of Pods (without using pagination) and createOrReplace resources. As far as I know, neither of these scenarios will open a stream. Does this mean I don't need to perform any closing operations? I will retry using one global client to handle everything again. If I encounter the OOM issue once more, I will continue to follow up on this issue. Thank you once again.

shawkins commented 6 months ago

As far as I know, neither of these scenarios will open a stream. Does this mean I don't need to perform any closing operations?

Correct. Neither of those operations maintains an on-going stream with the api server.

shawkins commented 6 months ago

Marking as closed until there is more information.

yan-v commented 6 months ago

Hi, I experience the same Error. Will try to use single client per app, but the examples here are confusing, if client is closable we should use it in try block with resource, as it shown in examples, which means after every flow is completed(exits the try block) the resource should be closed and created next time.

shawkins commented 6 months ago

Will try to use single client per app

@yan-v That should not fully resolve your issue if you are experiencing the same behavior as @tinystorm, and without a further reproducer it's hard to say exactly what is going on. If you are able to provide one that would be great.

but the examples here are confusing, if client is closable we should use it in try block with resource

The examples do not stress client reuse, that is correct - that should be covered in other parts of the docs and is certainly handled for you when you use the client as part of a platform, like quarkus.

If you see a place where additional comments / docs would help, please open an issue.

which means after every flow is completed(exits the try block) the resource should be closed and created next time.

The client implements Closeable because it exposes a close method - it does not require you to use it that way in a try catch block. I'm assuming that the examples were written the way they are so that they read as free-standing rather than showing injection of or separate lifecycle handling of the client.

tinystorm commented 6 months ago

Please double check that any InputStreams and Readers you obtain from the KuberenetesClient are getting closed. @shawkins Suddenly I realized that my API Server is based on haproxy proxy. Could this potentially cause connections to not be released?

shawkins commented 6 months ago

Suddenly I realized that my API Server is based on haproxy proxy. Could this potentially cause connections to not be released?

I am not sure - from a client perspective I'd hope that the connection at least returns to the pool, and that the job to cleanup orphaned allocations / streams works regardless of what is fronting the api server.

The other things to keep in mind are - what you are seeing could be http2 specific - are you able to use http1 instead? And what you are seeing could be okhttp specific - using a different httpclient with the kubernetes client might clarify this, or could highlight more clearly what is being left open.

yan-v commented 6 months ago

Will try to use single client per app

@yan-v That should not fully resolve your issue if you are experiencing the same behavior as @tinystorm, and without a further reproducer it's hard to say exactly what is going on. If you are able to provide one that would be great.

but the examples here are confusing, if client is closable we should use it in try block with resource

The examples do not stress client reuse, that is correct - that should be covered in other parts of the docs and is certainly handled for you when you use the client as part of a platform, like quarkus.

If you see a place where additional comments / docs would help, please open an issue.

which means after every flow is completed(exits the try block) the resource should be closed and created next time.

The client implements Closeable because it exposes a close method - it does not require you to use it that way in a try catch block. I'm assuming that the examples were written the way they are so that they read as free-standing rather than showing injection of or separate lifecycle handling of the client.

This is the error I got(a different trace than in initial post):

callStack="io.fabric8.kubernetes.client.KubernetesClientException: Java heap space
    at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult(OperationSupport.java:509)
    at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleResponse(OperationSupport.java:524)
    at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleGet(OperationSupport.java:467)
    at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.handleGet(BaseOperation.java:792)
    at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.requireFromServer(BaseOperation.java:193)
    at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.get(BaseOperation.java:149)
    at io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.scale(HasMetadataOperation.java:293)
    at io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.scale(HasMetadataOperation.java:44)

The simplified code I used before making k8sClient a singletone:

try (KubernetesClient k8sClient = new KubernetesClientBuilder().build()) {
            String k8sNamespace = "MY_NAMESPACE";
            k8sClient.apps().deployments().inNamespace(k8sNamespace).withName(deploymentName).scale(someMyNumber);
        } catch (Exception e) {
            ...
        }

The error appears not immediately, but after a few executions and about a hour or two, even if service was idle in those hours.

Thank you for your support!

shawkins commented 6 months ago

This is the error I got(a different trace than in initial post):

Unfortunately just the stacktrace is not enough - at least a heapdump to confirm what is being held in memory, and then if needed more reproduction steps.

The error appears not immediately, but after a few executions and about a hour or two, even if service was idle in those hours.

Start with a heap dump and see what is being held - if possible also try the alternatives mentioned in https://github.com/fabric8io/kubernetes-client/issues/5970#issuecomment-2110023598 - that should narrow things down as much as possible to where the problem lies.

yan-v commented 6 months ago

This is the error I got(a different trace than in initial post):

Unfortunately just the stacktrace is not enough - at least a heapdump to confirm what is being held in memory, and then if needed more reproduction steps.

The error appears not immediately, but after a few executions and about a hour or two, even if service was idle in those hours.

Start with a heap dump and see what is being held - if possible also try the alternatives mentioned in #5970 (comment) - that should narrow things down as much as possible to where the problem lies.

Luckily, client as a singleton works for now, i'll keep it as is. In case we'll experience the issue again I'll provide all the info.

Thank you again for your quick responses!

fabric8io / kubernetes-client