camunda / camunda-platform-helm

Camunda Platform 8 Self-Managed Helm charts
https://docs.camunda.io/docs/self-managed/overview/
Apache License 2.0
72 stars 129 forks source link

[ISSUE] TEST-Check-Connectors-webhook Integration Test Is Flaky #2349

Open slolatte opened 2 weeks ago

slolatte commented 2 weeks ago

Describe the issue:

I've noticed that the integration test 'TEST-Check-Connectors-webhook' is being quite flaky, often causing the helm chart setup to fail when triggered via GHA. A workflow rerun usually fixes this issue.

Actual behavior:

An example of this can be seen in this test run here.

Expected behavior:

I would expect the integration to be more robust and suggest it be refactored.

How to reproduce:

Context - please see Slack thread.

Logs:

Environment:

Please note: Without the following info, it's hard to resolve the issue and probably it will be closed.

hamza-m-masood commented 2 weeks ago

I think this is because the connector pod sometimes fails to start. I normally see this type of error when the connector pod hangs:

2024-09-16T09:45:51.506Z  WARN 1 --- [pool-2-thread-9] i.c.z.client.impl.ZeebeCallCredentials   : The request's security level does not guarantee that the credentials will be confidential.
aabouzaid commented 4 days ago

The integration test is not flaky, it's a reported issue in the Connectors where this issue shows in the logs:

2024-09-27T08:23:09.840Z  WARN 1 --- [lt-executor-141] io.camunda.zeebe.client.job.poller       : Failed to activate jobs for worker HTTP REST and job type io.camunda:http-json:1

io.grpc.StatusRuntimeException: CANCELLED
    at io.grpc.Status.asRuntimeException(Status.java:533)
    at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:481)
    at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:564)
    at io.grpc.internal.ClientCallImpl.access$100(ClientCallImpl.java:72)
    at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:729)
    at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:710)
    at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
    at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.io.IOException: Failed while requesting access token with status code 401 and message Unauthorized.
    at io.camunda.zeebe.client.impl.oauth.OAuthCredentialsProvider.fetchCredentials(OAuthCredentialsProvider.java:157)
    at io.camunda.zeebe.client.impl.oauth.OAuthCredentialsCache.computeIfMissingOrInvalid(OAuthCredentialsCache.java:100)
    at io.camunda.zeebe.client.impl.oauth.OAuthCredentialsProvider.applyCredentials(OAuthCredentialsProvider.java:79)
    at io.camunda.zeebe.client.impl.ZeebeCallCredentials.lambda$applyRequestMetadata$0(ZeebeCallCredentials.java:49)
    ... 3 common frames omitted

The issue is not just for that Connectors worker; there are many others in the logs with the same error. Restarting the Pod fixes the issue, which makes it more likely related to the app retry logic.

aabouzaid commented 4 days ago

I've disabled the Connectors test for 8.6 chart until the bug is fixed: https://github.com/camunda/camunda-platform-helm/commit/5784bc56fd6162269090f6fea018e142a2c15c9d