camunda-community-hub / zeebe-client-node-js

Node.js client library for Zeebe Microservices Orchestration Engine
https://camunda-community-hub.github.io/zeebe-client-node-js/
Apache License 2.0
152 stars 38 forks source link

Wrong expiry timer in OAuthProvider.ts #336

Closed dostalradim closed 11 months ago

dostalradim commented 1 year ago

There is probably a race condition in your token deleting mechanism. Our ZBWorker randomly throw exception with Grpc Stream Error: 16 UNAUTHENTICATED: Failed to parse bearer token, see cause for details. And it is always at the end of token expiration time. And it is not happening always it is really non-determenistic. Our tokens are valid for 5 minutes and application throw this issue only ~10 times per day.

What I checked

Expected Behavior

Obtain new token before expiration a little earlier.

Current Behavior

Expiry timer remove access token during usage in another parts of code.

Possible Solution

Change validityPeriod value to some lower value then token.expiry - current. In other words, remove token before expiration. Or do some better validity checker?

Steps to Reproduce

Install last camunda-platform helm and create application with ZBWorker like this.

Client settings:

const zeebeClientOptions: ZBClientOptions = {
    hostname: env.ZEEBE_GATEWAY_HOST,
    stdout: logger,
    longPoll: Duration.seconds.of(longPoolTimeoutSeconds),
    useTLS: env.ZEEBE_USETLS,
    oAuth: {
        url: env.ZEEBE_AUTHORIZATION_SERVER_URL,
        audience: env.ZEEBE_TOKEN_AUDIENCE,
        clientId: env.ZEEBE_CLIENT_ID,
        clientSecret: env.ZEEBE_CLIENT_SECRET,
        customRootCert: env.ZEEBE_CUSTOM_CERT_PATH ? fs.readFileSync(env.ZEEBE_CUSTOM_CERT_PATH) : undefined
    }
};

export const zeebe = new ZBClient(zeebeClientOptions);

Worker settings:

        zeebe.createWorker({
            taskType: key,
            maxJobsToActivate: 1,
            taskHandler: value,
            onReady: () => logger.debug(`Worker ${key} connected!`),
            onConnectionError: () => logger.error(`Worker ${key} disconnected!`)
        });

Errors from logs

2023-10-10T10:01:21.944Z oauth [OAuth] Requesting token from token endpoint...
---TRUNCATED---
2023-10-10T10:06:21.806Z oauth [OAuth] Using cached token from memory...
{"level":"info","message":"{\"timestamp\":\"2023-10-10T10:06:21.807Z\",\"context\":\"/app/node_modules/zeebe-node/dist/lib/StatefulLogInterceptor.js:16\",\"level\":20,\"message\":\"Stream [d6d3f4ae-9c70-44f8-a8eb-c41f0325acc7] opened...\",\"time\":\"2023 Oct-10 12:06:21PM\",\"pollInterval\":30000,\"namespace\":\"ZBWorker\",\"taskType\":\"io.camunda.zeebe:userTask\"}"}
2023-10-10T10:06:21.850Z grpc Error Error: 16 UNAUTHENTICATED: Failed to parse bearer token, see cause for details
    at callErrorFromStatus (/app/node_modules/@grpc/grpc-js/src/call.ts:81:17)
    at Object.onReceiveStatus (/app/node_modules/@grpc/grpc-js/src/client.ts:597:51)
    at /app/node_modules/@grpc/grpc-js/src/call-interface.ts:149:27
    at Object.onReceiveStatus (/app/node_modules/zeebe-node/src/lib/GrpcClient.ts:668:14)
    at InterceptingListenerImpl.onReceiveStatus (/app/node_modules/@grpc/grpc-js/src/call-interface.ts:145:19)
    at Object.onReceiveStatus (/app/node_modules/@grpc/grpc-js/src/client-interceptors.ts:417:48)
    at /app/node_modules/@grpc/grpc-js/src/resolving-call.ts:111:24
    at processTicksAndRejections (node:internal/process/task_queues:78:11)
for call at
    at ServiceClientImpl.makeServerStreamRequest (/app/node_modules/@grpc/grpc-js/src/client.ts:580:30)
    at ServiceClientImpl.<anonymous> (/app/node_modules/@grpc/grpc-js/src/make-client.ts:189:15)
    at GrpcClient.<computed> [as activateJobsStream] (/app/node_modules/zeebe-node/src/lib/GrpcClient.ts:287:39)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
    at ZBWorker.activateJobs (/app/node_modules/zeebe-node/src/lib/ZBWorkerBase.ts:543:13)
    at ZBWorker.poll (/app/node_modules/zeebe-node/src/lib/ZBWorkerBase.ts:454:21) {
  code: 16,
  details: 'Failed to parse bearer token, see cause for details',
  metadata: Metadata {
    internalRepr: Map(1) { 'content-type' => [Array] },
    options: {}
{"level":"error","message":"Worker io.camunda.zeebe:userTask disconnected!"}
  },
  progress: 'PROCESSED'
}
jwulf commented 1 year ago

Let me look into this some more. Thanks for reporting it, and for the comprehensive detail.

dostalradim commented 1 year ago

I did not tell you that we have 10 workers/handlers. So the probability of this error is bigger then. And the error is thrown just from one or at most two of them in one time. Thank you for reaction. Let me know when you will need more information.

jwulf commented 12 months ago

A quick update on this:

I've just completed support for multi-tenancy for the 8.3.0. I will address this issue at the start of next week, and ship a fix for it in the 8.3.0 release (which will drop as soon as I patch this - so Tuesday or Wednesday)

jwulf commented 11 months ago

The good news:

I have reproduced the problem on my machine using Self-Managed, by setting token expiry to 10 seconds, then starting the example/worker.js with longPoll: false and creating ten copies of the worker.

After a minute or two, all ten workers threw this error at the same time.

jwulf commented 11 months ago

Looks like the issue is this:

  1. Check for in-memory cached token, use if exists
  2. Check for cached to disk token, use if exists and not expired
  3. Get new token, start expiry timer
  4. Cache new token to disk
  5. On expiry timer fire, delete in-memory token

The issue seems to be an edge condition where the in-memory token is evicted from the cache and the new token is in-flight. The disk cache is then hit, and the token expiry is just about to happen. The disk cached token is used, and rejected, then the new token arrives and populates the in-memory and disk cache.

So, I've update the in-memory cache eviction to remove the disk-cached token, and set the expiry timer at 1 second before the token expiry.

jwulf commented 11 months ago

I have published 8.3.0-alpha9 with this fix rolled in.

@dostalradim Could you please test it out and let me know if it fixes the issue for you.

dostalradim commented 11 months ago

I installed application with old version, change token lifespan to 10s and wait to error appear. That happened after some time. Then I upgrade the package version to 8.3.0-alpha9 and wait for ~24 hours and error did not appear.

So, I think that the error is solved. Thank you for your fast solution.