🐛 Bug Report: Intermittent crash/restart not able to figure the root cause

📜 Description

There could be 2 separate issue or has some link to each other.

Intermittent crash and restart
- unhandled exception causing the crash of running pod/server and AKS restarts to get back in running state.
- makes the application unstable, though we have 2 replicas running at a time but have noticed second replica also crash after few minutes with same reason.
- On average we notice 5 crash restarts to each replica/pod/instance.
Scaffolder template Task stuck with the "processing" status. Intermittent thing -
- User submits the template form: on submit I assume it creates the task in the backend and has the task ID and frontend it redirects user to this URL /create/tasks/{taskId}
- the browser making a call to this API to get the task progress /api/scaffolder/v2/tasks/{taskId}/eventstream
- we saw behind the seen the tasks table has entry with the task in processing status but the task_events has no entry where it is trying to fetch event stream from
- event stream api fails with 504 server time out of 30 seconds and keep retrying - that seems like handled gracefully.

we are not sure if the crash/restart and scaffolder tasks not processing has some relation or not.

👍 Expected behavior

Whatever the process is making the pod crash, the exception should have handled gracefully with proper logging instead crashing the pod.

Also the unhandled exception is not clearly indicating what could have cause this.

👎 Actual Behavior with Screenshots

We are running 2 replicas in AKS for the internal backstage instance.

Issue 1: All we see in the log of the previously crashed pod is below:

node:internal/process/promises:288
            triggerUncaughtException(err, true /* fromPromise */);
            ^

Error: read ETIMEDOUT
    at TLSWrap.onStreamRead (node:internal/stream_base_commons:217:20)
    at TLSWrap.callbackTrampoline (node:internal/async_hooks:128:17) {
  errno: -110,
  code: 'ETIMEDOUT',
  syscall: 'read'
}

Node.js v18.20.0

Issue 2: on scaffolder task stays in processing (Intermittent) sometimes only - not always and also with not specific template, so it is hard to narrow down the cause.

👟 Reproduction steps

It is quite unique to our deployed backstage instance We are not able to reproduce in local or some lower environment, of course the production is most used instance, we have noticed it happening in UAT but not that often - assuming traffic is low as only dev team is using it.

There is no specific steps - it happens 5-6 times crash restart everyday, there is no pattern in time but one thing we have limited knowledge about it the Task runner

📃 Provide the context for the Bug.

No response

🖥️ Your Environment

Index.ts

refactor and upgrade of many deprecated items are long due on this file. Main reason to share this one is - we still have the deprecated TaskScheduler and I wonder if that and the 2 replica/instances running has something to do with the scaffolder tasks not running well.
We are prioritizing to migrate to new backend system soon, but not sure if that will resolve both the problem. (tech debt - it is what it is at the moment)
It seems scaffolder plugin has its own way to process the tasks, - I reviewed the code high level in backstage repo(here), I am not sure if from consumer perspective how do I debug on why the TASK is stuck with status = processing never get to next stage like Failed or Complete? When it end up having a LOT tasks with status = processing what happens - and how can I clean them up or put some logger to find out the reason it is not getting processed?

import Router from 'express-promise-router';
import {
  createServiceBuilder,
  loadBackendConfig,
  getRootLogger,
  useHotMemoize,
  notFoundHandler,
  CacheManager,
  DatabaseManager,
  UrlReaders,
  ServerTokenManager,
  HostDiscovery,
} from '@backstage/backend-common';
import { TaskScheduler } from '@backstage/backend-tasks';
import { Config } from '@backstage/config';
import app from './plugins/app';
import auth from './plugins/auth';
import catalog from './plugins/catalog';
import scaffolder from './plugins/scaffolder';
import proxy from './plugins/proxy';
import techdocs from './plugins/techdocs';
import search from './plugins/search';
import apibackend from './plugins/api-backend';
import { PluginEnvironment } from './types';
import { ServerPermissionClient } from '@backstage/plugin-permission-node';
import { DefaultIdentityClient } from '@backstage/plugin-auth-node';
import announcements from './plugins/announcements';
import 'global-agent/bootstrap';
import permission from './plugins/permission';

function makeCreateEnv(config: Config) {
  const root = getRootLogger();
  const reader = UrlReaders.default({ logger: root, config });
  const discovery = HostDiscovery.fromConfig(config);

  const tokenManager = ServerTokenManager.fromConfig(config, { logger: root });
  const permissions = ServerPermissionClient.fromConfig(config, {
    discovery,
    tokenManager,
  });

  const cacheManager = CacheManager.fromConfig(config);
  const databaseManager: any = DatabaseManager.fromConfig(config, {
    logger: root,
  });
  const taskScheduler = TaskScheduler.fromConfig(config, { databaseManager });

  const identity = DefaultIdentityClient.create({
    discovery,
    algorithms: ['RS256', 'ES256'],
  });

  root.info(`Created UrlReader ${reader}`);

  return (plugin: string): PluginEnvironment => {
    const logger = root.child({ type: 'plugin', plugin });
    const database = databaseManager.forPlugin(plugin);
    const cache = cacheManager.forPlugin(plugin);
    const scheduler = taskScheduler.forPlugin(plugin);
    return {
      logger,
      database,
      cache,
      config,
      reader,
      discovery,
      tokenManager,
      scheduler,
      permissions,
      identity,
    };
  };
}

async function main() {
  const logger = getRootLogger();
  const config = await loadBackendConfig({
    argv: process.argv,
    logger: logger,
  });
  const createEnv = makeCreateEnv(config);

  const catalogEnv = useHotMemoize(module, () => createEnv('catalog'));
  const scaffolderEnv = useHotMemoize(module, () => createEnv('scaffolder'));
  const authEnv = useHotMemoize(module, () => createEnv('auth'));
  const proxyEnv = useHotMemoize(module, () => createEnv('proxy'));
  const techdocsEnv = useHotMemoize(module, () => createEnv('techdocs'));
  const searchEnv = useHotMemoize(module, () => createEnv('search'));
  const appEnv = useHotMemoize(module, () => createEnv('app'));
  const apiEnv = useHotMemoize(module, () => createEnv('apibackend'));
  const permissionEnv = useHotMemoize(module, () => createEnv('permission'));
  const announcementsEnv = useHotMemoize(module, () =>
    createEnv('announcements'),
  );

  const apiRouter = Router();
  apiRouter.use('/catalog', await catalog(catalogEnv));

  apiRouter.use('/scaffolder', (req, _res, next) => {
    delete req.headers.authorization;
    next();
  });
  apiRouter.use('/scaffolder', await scaffolder(scaffolderEnv, catalogEnv));

  apiRouter.use('/auth', await auth(authEnv));
  apiRouter.use('/techdocs', await techdocs(techdocsEnv));
  apiRouter.use('/proxy', await proxy(proxyEnv));
  apiRouter.use('/search', await search(searchEnv));
  apiRouter.use('/api', await apibackend(apiEnv));
  apiRouter.use('/announcements', await announcements(announcementsEnv));
  apiRouter.use('/permission', await permission(permissionEnv));

  // Add backends ABOVE this line; this 404 handler is the catch-all fallback
  apiRouter.use(notFoundHandler());

  const service = createServiceBuilder(module)
    .loadConfig(config)
    .addRouter('/api', apiRouter)
    .addRouter('', await app(appEnv))
    .addRouter('/proxy', await proxy(proxyEnv));

  await service.start().catch(err => {
    logger.error(err);
    process.exit(1);
  });
}

module.hot?.accept();
main().catch(error => {
  console.error('Backend failed to start up', error);
  process.exit(1);
});

node: v18.20.0
yarn: 1.22.19
cli:  0.26.10 (installed)
backstage:  1.25.0

Dependencies:
  @backstage/app-defaults                                          1.5.3
  @backstage/backend-app-api                                       0.5.14, 0.6.2, 0.7.5
  @backstage/backend-common                                        0.20.2, 0.21.6, 0.22.0
  @backstage/backend-defaults                                      0.2.16
  @backstage/backend-dev-utils                                     0.1.4
  @backstage/backend-openapi-utils                                 0.1.9
  @backstage/backend-plugin-api                                    0.6.16, 0.6.18
  @backstage/backend-tasks                                         0.5.21, 0.5.23
  @backstage/backend-test-utils                                    0.3.6, 0.3.8
  @backstage/catalog-client                                        1.6.3, 1.6.5
  @backstage/catalog-model                                         1.4.5, 1.5.0
  @backstage/cli-common                                            0.1.13
  @backstage/cli-node                                              0.2.4, 0.2.5
  @backstage/cli                                                   0.26.2
  @backstage/config-loader                                         1.7.0, 1.8.0
  @backstage/config                                                1.2.0
  @backstage/core-app-api                                          1.12.3
  @backstage/core-compat-api                                       0.2.3
  @backstage/core-components                                       0.13.10, 0.14.3, 0.9.5
  @backstage/core-plugin-api                                       1.9.1
  @backstage/dev-utils                                             1.0.30
  @backstage/e2e-test-utils                                        0.1.1
  @backstage/errors                                                1.2.4
  @backstage/eslint-plugin                                         0.1.6
  @backstage/frontend-plugin-api                                   0.6.3
  @backstage/integration-aws-node                                  0.1.12
  @backstage/integration-react                                     1.1.25
  @backstage/integration                                           1.11.0, 1.9.1
  @backstage/plugin-api-docs                                       0.11.3
  @backstage/plugin-app-backend                                    0.3.64
  @backstage/plugin-app-node                                       0.1.16
  @backstage/plugin-auth-backend-module-atlassian-provider         0.1.8
  @backstage/plugin-auth-backend-module-aws-alb-provider           0.1.8
  @backstage/plugin-auth-backend-module-gcp-iap-provider           0.2.11
  @backstage/plugin-auth-backend-module-github-provider            0.1.13
  @backstage/plugin-auth-backend-module-gitlab-provider            0.1.13
  @backstage/plugin-auth-backend-module-google-provider            0.1.13
  @backstage/plugin-auth-backend-module-guest-provider             0.1.2
  @backstage/plugin-auth-backend-module-microsoft-provider         0.1.11
  @backstage/plugin-auth-backend-module-oauth2-provider            0.1.13
  @backstage/plugin-auth-backend-module-oauth2-proxy-provider      0.1.9
  @backstage/plugin-auth-backend-module-oidc-provider              0.1.7
  @backstage/plugin-auth-backend-module-okta-provider              0.0.9
  @backstage/plugin-auth-backend                                   0.22.3
  @backstage/plugin-auth-node                                      0.4.11, 0.4.13
  @backstage/plugin-auth-react                                     0.0.3
  @backstage/plugin-bazaar-backend                                 0.3.14
  @backstage/plugin-bazaar                                         0.2.25
  @backstage/plugin-catalog-backend-module-scaffolder-entity-model 0.1.14
  @backstage/plugin-catalog-backend                                1.21.0
  @backstage/plugin-catalog-common                                 1.0.22
  @backstage/plugin-catalog-graph                                  0.4.3
  @backstage/plugin-catalog-import                                 0.10.9
  @backstage/plugin-catalog-node                                   1.11.0
  @backstage/plugin-catalog-react                                  1.11.2
  @backstage/plugin-catalog                                        1.18.2
  @backstage/plugin-events-node                                    0.3.2, 0.3.4
  @backstage/plugin-org                                            0.6.23
  @backstage/plugin-permission-backend                             0.5.40
  @backstage/plugin-permission-common                              0.7.13
  @backstage/plugin-permission-node                                0.7.27, 0.7.29
  @backstage/plugin-permission-react                               0.4.21
  @backstage/plugin-proxy-backend                                  0.4.14
  @backstage/plugin-scaffolder-backend-module-azure                0.1.8
  @backstage/plugin-scaffolder-backend-module-bitbucket-cloud      0.1.6
  @backstage/plugin-scaffolder-backend-module-bitbucket-server     0.1.6
  @backstage/plugin-scaffolder-backend-module-bitbucket            0.2.6
  @backstage/plugin-scaffolder-backend-module-gerrit               0.1.8
  @backstage/plugin-scaffolder-backend-module-gitea                0.1.6
  @backstage/plugin-scaffolder-backend-module-github               0.2.6
  @backstage/plugin-scaffolder-backend-module-gitlab               0.3.2
  @backstage/plugin-scaffolder-backend                             1.22.3
  @backstage/plugin-scaffolder-common                              1.5.1
  @backstage/plugin-scaffolder-node                                0.4.2
  @backstage/plugin-scaffolder-react                               1.8.3
  @backstage/plugin-scaffolder                                     1.19.2
  @backstage/plugin-search-backend-module-catalog                  0.1.21
  @backstage/plugin-search-backend-module-elasticsearch            1.3.19
  @backstage/plugin-search-backend-module-pg                       0.5.25
  @backstage/plugin-search-backend-module-techdocs                 0.1.21
  @backstage/plugin-search-backend-node                            1.2.20
  @backstage/plugin-search-backend                                 1.5.6
  @backstage/plugin-search-common                                  1.2.11
  @backstage/plugin-search-react                                   1.7.9
  @backstage/plugin-search                                         1.4.9
  @backstage/plugin-tech-radar                                     0.7.2
  @backstage/plugin-techdocs-backend                               1.10.3
  @backstage/plugin-techdocs-module-addons-contrib                 1.1.8
  @backstage/plugin-techdocs-node                                  1.12.2
  @backstage/plugin-techdocs-react                                 1.2.2
  @backstage/plugin-techdocs                                       1.10.3
  @backstage/plugin-user-settings                                  0.8.4
  @backstage/release-manifests                                     0.0.11
  @backstage/test-utils                                            1.5.3
  @backstage/theme                                                 0.2.19, 0.4.4, 0.5.2
  @backstage/types                                                 1.1.1
  @backstage/version-bridge                                        1.0.7
✨  Done in 2.24s.

👀 Have you spent some time to check if this bug has been raised before?

[X] I checked and didn't find similar issue

🏢 Have you read the Code of Conduct?

[X] I have read the Code of Conduct

Are you willing to submit PR?

No, but I'm happy to collaborate on a PR with someone else

Hmm, what's going on in your custom version of the TaskScheduler? Wondering if there's some things which are not caught there which is causing the backend to crash.

Upgrading the new Backend System will definitely help you in the long run, but not sure it's going to help with this particular failure - the Backstage version is also pretty outdated, but I think that this has been stable in the Scaffolder for some time, so want to work out why that error is not being caught and where it's coming from in the first place.

Hmm, what's going on in your custom version of the TaskScheduler? Wondering if there's some things which are not caught there which is causing the backend to crash.

we are using the TaskScheduler to pass into the plugin's environment and then it is used by the Catalog - that is to run some custom entity providers, some cron job - I reviewed the code and exceptions are handled to not crash the pod/replica.
search plugin is also using env.scheduler no sure if search plugin might have anything to do with unhandled exception.

Upgrading the new Backend System will definitely help you in the long run, but not sure it's going to help with this particular failure - the Backstage version is also pretty outdated, but I think that this has been stable in the Scaffolder for some time, so want to work out why that error is not being caught and where it's coming from in the first place.

It seems scaffolder plugin has its own way to process the tasks, - I reviewed the code high level in backstage repo(here), I am not sure if from consumer perspective how do I debug on why the TASK is stuck with status = processing never get to next stage like Failed or Complete? When it end up having a LOT tasks with status = processing what happens - and how can I clean them up or put some logger to find out the reason it is not getting processed? (I updated initial post with this point.)

One more item, I have following knexConfig values, now thinking of increasing these values. and making the propagateCreateError: true - any thoughts?

    knexConfig:
      pool:
        min: 15
        max: 30
        acquireTimeoutMillis: 60000
        createTimeoutMillis: 30000
        destroyTimeoutMillis: 5000
        idleTimeoutMillis: 60000
        reapIntervalMillis: 1000
        createRetryIntervalMillis: 200
        propagateCreateError: false

some history on this

Initially this config wasn't there, so assuming the default was used.
Crash/restarts with unhandled exception was still happening with/without knexConfig
We introduced the knexConfig, to solve KnexTimeoutError shown below - it was throwing this error at time of deployment where K8S starts the new pod while there are existing multiple pods already in running state, for some reason when we introduced the knexConfig it resolved that issue at time of deployment.
But now I wonder can this affect Scaffolder Backend to not process the scaffolder tasks? I do see the number of tasks with status stuck with "processing" increased since this update!!

KnexTimeoutError: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call?
   at /app/node_modules/@backstage/backend-common/dist/index.cjs.js:1502:17
   at async KeyStores.fromConfig (/app/node_modules/@backstage/plugin-auth-backend/dist/index.cjs.js:2565:35)
   at async Object.createRouter (/app/node_modules/@backstage/plugin-auth-backend/dist/index.cjs.js:2749:20)
   at async createPlugin$h (/app/packages/backend/dist/index.cjs.js:72:10)
   at async main (/app/packages/backend/dist/index.cjs.js:2067:26)

So one thing to note is that if you have tasks running in the scaffolder, and the backend crashes, which might not be the fault of the scaffolder, it might die for another reasons, some other plugin for example as the infra is shared, then this will cause the task to get stuck as right now tasks will not get picked back up again on restart. You might see this issue quite often if the backend is crashing a lot, and you have long running tasks as this increases the chances that something is underway when it drops the task.

This resuming of tasks is being worked on at the moment, but it's not quite ready yet.

Interesting with the knexConfig though. Not sure that the scaffolder backend should create another client for each task or anything, it shouldn't at least.

Do you have the logs at the reason why it crashes, or do you think that it's Knex: Timeout acquiring a connection. The pool is probably full. which is causing the crash of the container?

Observations

Intermittent bases - when I try running some scaffolder task - submit the form, task is created but not processed, and after few minutes we observe the pod crash/restart with error below.
We noticed the crash/restart are more frequent on business days like 5-6 times on random times of the day and over the weekend we hardly see one incident of it. We are also looking to add Quantum Metric like tool to replay/monitor user activity

node:internal/process/promises:288
            triggerUncaughtException(err, true /* fromPromise */);
            ^

Error: read ETIMEDOUT
    at TLSWrap.onStreamRead (node:internal/stream_base_commons:217:20)
    at TLSWrap.callbackTrampoline (node:internal/async_hooks:128:17) {
  errno: -110,
  code: 'ETIMEDOUT',
  syscall: 'read'
}

Node.js v18.20.0

Knex: Timeout acquiring a connection. The pool is probably full.

knex error was separate and different issue - that was only happening at the time of deployment - after introducing knexConfig we were able to resolve the "deployment time crash" but the ongoing intermittent crash continues.
I suspect the task processing issue possibly started after we fix that error and introduced knexConfig (reason: when I run some query against database to see the tasks count where status = processing, count has increased in good amount only after the release that includes knexConfig update to solve other problem, we also enabled the debug logs but nothing interesting to blame)

Not sure if this is related, found similarities with the unhandled exception error!

https://github.com/brianc/node-postgres/issues/2764

@laharshah I wanna come back to some of your knex config that I've been digging around with recently as we've seen some of these errors in tests in Github Actions for some of our CI pipeline.

One thing I would say is stay away from using propagateCreateError: false. knex isn't designed to work with it and it will break some other things.

One config that we had some success with in CI builds is something like this:

knexConfig:
  acquireConnectionTimeout: 600000
  pool: 
    acquireTimeoutMillis: 10000
    createTimeoutMillis: 600000
    destroyTimeoutMillis: 600000
    idleTimeoutMillis: 600000
    acquireTimeoutMillis: 600000

The acquireConnectionTimeout outside of the pool block I think was a key part, but we haven't followed up to see which ones are actually that important. Maybe give those a try and see if it helps? :pray:

It's also possible that this issue is actually pretty closely related to https://github.com/backstage/backstage/issues/19863

I wonder if this is something to do with the fact now that every backend plugin that you have installed will create a new table for the auth keys required for service-to-service auth, and it's possible that the pool is actually flooded because of each one of these plugins. How many plugins do you have installed in the backend?

@benjdlambert Thanks for looking into!

1) knexConfig updates

We will try what you have recommended and see if that makes any difference.
On the side we are also migrating the DB to Azure PostgreSQL - Flexible Server, with fresh start on most of the data includes catalog entities and scaffolder templates, docs, tasks, etc etc.
Backstage instance we have 2 replicas in AKS, after DB migration(on-prem to azure) will help with latency for sure - and hopefully resolve some hidden networking issues(?) we will see how it goes!

2) Number of Backend plugins

Looks like we have around 15 custom backend plugins in addition to around 15 software templates and few tech docs and other built-in backstage plugins. idk if this is helpful but fyi, we have 8 rows in backstage_plugin_auth > signing_keys table
Is there a way to know the pool is flooded or not? we have min 15, and max 30 at the moment in knexConfig
Is there a way to look in the database for the plugins and auth keys?

@benjdlambert Tried the knex config updates - but it didn't solve the crash-restarts and tasks being stuck problem. Issue is resolved with following 2 actions.

Migration of the database from on-prem to the DB to Azure PostgreSQL - Flexible Server. Nodejs app was already running in AKS.
With the DB switch/migration, we did the fresh ingest of all the entities and did fresh start for most of the data tables.

It is been a month we did the migration and since then

No crash-restart observed - we see the pods with 0 restarts
No task is stuck in status = processing, all submitted template form tasks either completed or failed.
No one complained when submitting task it was just stuck on the UI - complementing the point above
Data clean up also help with many warning logs
Also with no crash - pods are stable for long term - making the application more stable - considering some cache on pods for documents

Overall both the actions helped to address underlying data and/or the network issue.

backstage / backstage