backstage / backstage

Backstage is an open framework for building developer portals
https://backstage.io/
Apache License 2.0
28.13k stars 5.96k forks source link

🐛 Bug Report: Intermittent crash/restart not able to figure the root cause #25872

Closed laharshah closed 2 weeks ago

laharshah commented 3 months ago

📜 Description

There could be 2 separate issue or has some link to each other.

  1. Intermittent crash and restart

    • unhandled exception causing the crash of running pod/server and AKS restarts to get back in running state.
    • makes the application unstable, though we have 2 replicas running at a time but have noticed second replica also crash after few minutes with same reason.
    • On average we notice 5 crash restarts to each replica/pod/instance.
  2. Scaffolder template Task stuck with the "processing" status. Intermittent thing -

    • User submits the template form: on submit I assume it creates the task in the backend and has the task ID and frontend it redirects user to this URL /create/tasks/{taskId}
    • the browser making a call to this API to get the task progress /api/scaffolder/v2/tasks/{taskId}/eventstream
    • we saw behind the seen the tasks table has entry with the task in processing status but the task_events has no entry where it is trying to fetch event stream from
    • event stream api fails with 504 server time out of 30 seconds and keep retrying - that seems like handled gracefully.

we are not sure if the crash/restart and scaffolder tasks not processing has some relation or not.

👍 Expected behavior

Whatever the process is making the pod crash, the exception should have handled gracefully with proper logging instead crashing the pod.

Also the unhandled exception is not clearly indicating what could have cause this.

👎 Actual Behavior with Screenshots

We are running 2 replicas in AKS for the internal backstage instance.

Issue 1: All we see in the log of the previously crashed pod is below:

node:internal/process/promises:288
            triggerUncaughtException(err, true /* fromPromise */);
            ^

Error: read ETIMEDOUT
    at TLSWrap.onStreamRead (node:internal/stream_base_commons:217:20)
    at TLSWrap.callbackTrampoline (node:internal/async_hooks:128:17) {
  errno: -110,
  code: 'ETIMEDOUT',
  syscall: 'read'
}

Node.js v18.20.0

Issue 2: on scaffolder task stays in processing (Intermittent) sometimes only - not always and also with not specific template, so it is hard to narrow down the cause.

image

👟 Reproduction steps

It is quite unique to our deployed backstage instance We are not able to reproduce in local or some lower environment, of course the production is most used instance, we have noticed it happening in UAT but not that often - assuming traffic is low as only dev team is using it.

There is no specific steps - it happens 5-6 times crash restart everyday, there is no pattern in time but one thing we have limited knowledge about it the Task runner

📃 Provide the context for the Bug.

No response

🖥️ Your Environment

Index.ts

import Router from 'express-promise-router';
import {
  createServiceBuilder,
  loadBackendConfig,
  getRootLogger,
  useHotMemoize,
  notFoundHandler,
  CacheManager,
  DatabaseManager,
  UrlReaders,
  ServerTokenManager,
  HostDiscovery,
} from '@backstage/backend-common';
import { TaskScheduler } from '@backstage/backend-tasks';
import { Config } from '@backstage/config';
import app from './plugins/app';
import auth from './plugins/auth';
import catalog from './plugins/catalog';
import scaffolder from './plugins/scaffolder';
import proxy from './plugins/proxy';
import techdocs from './plugins/techdocs';
import search from './plugins/search';
import apibackend from './plugins/api-backend';
import { PluginEnvironment } from './types';
import { ServerPermissionClient } from '@backstage/plugin-permission-node';
import { DefaultIdentityClient } from '@backstage/plugin-auth-node';
import announcements from './plugins/announcements';
import 'global-agent/bootstrap';
import permission from './plugins/permission';

function makeCreateEnv(config: Config) {
  const root = getRootLogger();
  const reader = UrlReaders.default({ logger: root, config });
  const discovery = HostDiscovery.fromConfig(config);

  const tokenManager = ServerTokenManager.fromConfig(config, { logger: root });
  const permissions = ServerPermissionClient.fromConfig(config, {
    discovery,
    tokenManager,
  });

  const cacheManager = CacheManager.fromConfig(config);
  const databaseManager: any = DatabaseManager.fromConfig(config, {
    logger: root,
  });
  const taskScheduler = TaskScheduler.fromConfig(config, { databaseManager });

  const identity = DefaultIdentityClient.create({
    discovery,
    algorithms: ['RS256', 'ES256'],
  });

  root.info(`Created UrlReader ${reader}`);

  return (plugin: string): PluginEnvironment => {
    const logger = root.child({ type: 'plugin', plugin });
    const database = databaseManager.forPlugin(plugin);
    const cache = cacheManager.forPlugin(plugin);
    const scheduler = taskScheduler.forPlugin(plugin);
    return {
      logger,
      database,
      cache,
      config,
      reader,
      discovery,
      tokenManager,
      scheduler,
      permissions,
      identity,
    };
  };
}

async function main() {
  const logger = getRootLogger();
  const config = await loadBackendConfig({
    argv: process.argv,
    logger: logger,
  });
  const createEnv = makeCreateEnv(config);

  const catalogEnv = useHotMemoize(module, () => createEnv('catalog'));
  const scaffolderEnv = useHotMemoize(module, () => createEnv('scaffolder'));
  const authEnv = useHotMemoize(module, () => createEnv('auth'));
  const proxyEnv = useHotMemoize(module, () => createEnv('proxy'));
  const techdocsEnv = useHotMemoize(module, () => createEnv('techdocs'));
  const searchEnv = useHotMemoize(module, () => createEnv('search'));
  const appEnv = useHotMemoize(module, () => createEnv('app'));
  const apiEnv = useHotMemoize(module, () => createEnv('apibackend'));
  const permissionEnv = useHotMemoize(module, () => createEnv('permission'));
  const announcementsEnv = useHotMemoize(module, () =>
    createEnv('announcements'),
  );

  const apiRouter = Router();
  apiRouter.use('/catalog', await catalog(catalogEnv));

  apiRouter.use('/scaffolder', (req, _res, next) => {
    delete req.headers.authorization;
    next();
  });
  apiRouter.use('/scaffolder', await scaffolder(scaffolderEnv, catalogEnv));

  apiRouter.use('/auth', await auth(authEnv));
  apiRouter.use('/techdocs', await techdocs(techdocsEnv));
  apiRouter.use('/proxy', await proxy(proxyEnv));
  apiRouter.use('/search', await search(searchEnv));
  apiRouter.use('/api', await apibackend(apiEnv));
  apiRouter.use('/announcements', await announcements(announcementsEnv));
  apiRouter.use('/permission', await permission(permissionEnv));

  // Add backends ABOVE this line; this 404 handler is the catch-all fallback
  apiRouter.use(notFoundHandler());

  const service = createServiceBuilder(module)
    .loadConfig(config)
    .addRouter('/api', apiRouter)
    .addRouter('', await app(appEnv))
    .addRouter('/proxy', await proxy(proxyEnv));

  await service.start().catch(err => {
    logger.error(err);
    process.exit(1);
  });
}

module.hot?.accept();
main().catch(error => {
  console.error('Backend failed to start up', error);
  process.exit(1);
});
node: v18.20.0
yarn: 1.22.19
cli:  0.26.10 (installed)
backstage:  1.25.0

Dependencies:
  @backstage/app-defaults                                          1.5.3
  @backstage/backend-app-api                                       0.5.14, 0.6.2, 0.7.5
  @backstage/backend-common                                        0.20.2, 0.21.6, 0.22.0
  @backstage/backend-defaults                                      0.2.16
  @backstage/backend-dev-utils                                     0.1.4
  @backstage/backend-openapi-utils                                 0.1.9
  @backstage/backend-plugin-api                                    0.6.16, 0.6.18
  @backstage/backend-tasks                                         0.5.21, 0.5.23
  @backstage/backend-test-utils                                    0.3.6, 0.3.8
  @backstage/catalog-client                                        1.6.3, 1.6.5
  @backstage/catalog-model                                         1.4.5, 1.5.0
  @backstage/cli-common                                            0.1.13
  @backstage/cli-node                                              0.2.4, 0.2.5
  @backstage/cli                                                   0.26.2
  @backstage/config-loader                                         1.7.0, 1.8.0
  @backstage/config                                                1.2.0
  @backstage/core-app-api                                          1.12.3
  @backstage/core-compat-api                                       0.2.3
  @backstage/core-components                                       0.13.10, 0.14.3, 0.9.5
  @backstage/core-plugin-api                                       1.9.1
  @backstage/dev-utils                                             1.0.30
  @backstage/e2e-test-utils                                        0.1.1
  @backstage/errors                                                1.2.4
  @backstage/eslint-plugin                                         0.1.6
  @backstage/frontend-plugin-api                                   0.6.3
  @backstage/integration-aws-node                                  0.1.12
  @backstage/integration-react                                     1.1.25
  @backstage/integration                                           1.11.0, 1.9.1
  @backstage/plugin-api-docs                                       0.11.3
  @backstage/plugin-app-backend                                    0.3.64
  @backstage/plugin-app-node                                       0.1.16
  @backstage/plugin-auth-backend-module-atlassian-provider         0.1.8
  @backstage/plugin-auth-backend-module-aws-alb-provider           0.1.8
  @backstage/plugin-auth-backend-module-gcp-iap-provider           0.2.11
  @backstage/plugin-auth-backend-module-github-provider            0.1.13
  @backstage/plugin-auth-backend-module-gitlab-provider            0.1.13
  @backstage/plugin-auth-backend-module-google-provider            0.1.13
  @backstage/plugin-auth-backend-module-guest-provider             0.1.2
  @backstage/plugin-auth-backend-module-microsoft-provider         0.1.11
  @backstage/plugin-auth-backend-module-oauth2-provider            0.1.13
  @backstage/plugin-auth-backend-module-oauth2-proxy-provider      0.1.9
  @backstage/plugin-auth-backend-module-oidc-provider              0.1.7
  @backstage/plugin-auth-backend-module-okta-provider              0.0.9
  @backstage/plugin-auth-backend                                   0.22.3
  @backstage/plugin-auth-node                                      0.4.11, 0.4.13
  @backstage/plugin-auth-react                                     0.0.3
  @backstage/plugin-bazaar-backend                                 0.3.14
  @backstage/plugin-bazaar                                         0.2.25
  @backstage/plugin-catalog-backend-module-scaffolder-entity-model 0.1.14
  @backstage/plugin-catalog-backend                                1.21.0
  @backstage/plugin-catalog-common                                 1.0.22
  @backstage/plugin-catalog-graph                                  0.4.3
  @backstage/plugin-catalog-import                                 0.10.9
  @backstage/plugin-catalog-node                                   1.11.0
  @backstage/plugin-catalog-react                                  1.11.2
  @backstage/plugin-catalog                                        1.18.2
  @backstage/plugin-events-node                                    0.3.2, 0.3.4
  @backstage/plugin-org                                            0.6.23
  @backstage/plugin-permission-backend                             0.5.40
  @backstage/plugin-permission-common                              0.7.13
  @backstage/plugin-permission-node                                0.7.27, 0.7.29
  @backstage/plugin-permission-react                               0.4.21
  @backstage/plugin-proxy-backend                                  0.4.14
  @backstage/plugin-scaffolder-backend-module-azure                0.1.8
  @backstage/plugin-scaffolder-backend-module-bitbucket-cloud      0.1.6
  @backstage/plugin-scaffolder-backend-module-bitbucket-server     0.1.6
  @backstage/plugin-scaffolder-backend-module-bitbucket            0.2.6
  @backstage/plugin-scaffolder-backend-module-gerrit               0.1.8
  @backstage/plugin-scaffolder-backend-module-gitea                0.1.6
  @backstage/plugin-scaffolder-backend-module-github               0.2.6
  @backstage/plugin-scaffolder-backend-module-gitlab               0.3.2
  @backstage/plugin-scaffolder-backend                             1.22.3
  @backstage/plugin-scaffolder-common                              1.5.1
  @backstage/plugin-scaffolder-node                                0.4.2
  @backstage/plugin-scaffolder-react                               1.8.3
  @backstage/plugin-scaffolder                                     1.19.2
  @backstage/plugin-search-backend-module-catalog                  0.1.21
  @backstage/plugin-search-backend-module-elasticsearch            1.3.19
  @backstage/plugin-search-backend-module-pg                       0.5.25
  @backstage/plugin-search-backend-module-techdocs                 0.1.21
  @backstage/plugin-search-backend-node                            1.2.20
  @backstage/plugin-search-backend                                 1.5.6
  @backstage/plugin-search-common                                  1.2.11
  @backstage/plugin-search-react                                   1.7.9
  @backstage/plugin-search                                         1.4.9
  @backstage/plugin-tech-radar                                     0.7.2
  @backstage/plugin-techdocs-backend                               1.10.3
  @backstage/plugin-techdocs-module-addons-contrib                 1.1.8
  @backstage/plugin-techdocs-node                                  1.12.2
  @backstage/plugin-techdocs-react                                 1.2.2
  @backstage/plugin-techdocs                                       1.10.3
  @backstage/plugin-user-settings                                  0.8.4
  @backstage/release-manifests                                     0.0.11
  @backstage/test-utils                                            1.5.3
  @backstage/theme                                                 0.2.19, 0.4.4, 0.5.2
  @backstage/types                                                 1.1.1
  @backstage/version-bridge                                        1.0.7
✨  Done in 2.24s.

👀 Have you spent some time to check if this bug has been raised before?

🏢 Have you read the Code of Conduct?

Are you willing to submit PR?

No, but I'm happy to collaborate on a PR with someone else

benjdlambert commented 3 months ago

Hmm, what's going on in your custom version of the TaskScheduler? Wondering if there's some things which are not caught there which is causing the backend to crash.

Upgrading the new Backend System will definitely help you in the long run, but not sure it's going to help with this particular failure - the Backstage version is also pretty outdated, but I think that this has been stable in the Scaffolder for some time, so want to work out why that error is not being caught and where it's coming from in the first place.

laharshah commented 2 months ago

Hmm, what's going on in your custom version of the TaskScheduler? Wondering if there's some things which are not caught there which is causing the backend to crash.

Upgrading the new Backend System will definitely help you in the long run, but not sure it's going to help with this particular failure - the Backstage version is also pretty outdated, but I think that this has been stable in the Scaffolder for some time, so want to work out why that error is not being caught and where it's coming from in the first place.

It seems scaffolder plugin has its own way to process the tasks, - I reviewed the code high level in backstage repo(here), I am not sure if from consumer perspective how do I debug on why the TASK is stuck with status = processing never get to next stage like Failed or Complete? When it end up having a LOT tasks with status = processing what happens - and how can I clean them up or put some logger to find out the reason it is not getting processed? (I updated initial post with this point.)


One more item, I have following knexConfig values, now thinking of increasing these values. and making the propagateCreateError: true - any thoughts?

    knexConfig:
      pool:
        min: 15
        max: 30
        acquireTimeoutMillis: 60000
        createTimeoutMillis: 30000
        destroyTimeoutMillis: 5000
        idleTimeoutMillis: 60000
        reapIntervalMillis: 1000
        createRetryIntervalMillis: 200
        propagateCreateError: false

some history on this

KnexTimeoutError: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call?
   at /app/node_modules/@backstage/backend-common/dist/index.cjs.js:1502:17
   at async KeyStores.fromConfig (/app/node_modules/@backstage/plugin-auth-backend/dist/index.cjs.js:2565:35)
   at async Object.createRouter (/app/node_modules/@backstage/plugin-auth-backend/dist/index.cjs.js:2749:20)
   at async createPlugin$h (/app/packages/backend/dist/index.cjs.js:72:10)
   at async main (/app/packages/backend/dist/index.cjs.js:2067:26) 
benjdlambert commented 2 months ago

So one thing to note is that if you have tasks running in the scaffolder, and the backend crashes, which might not be the fault of the scaffolder, it might die for another reasons, some other plugin for example as the infra is shared, then this will cause the task to get stuck as right now tasks will not get picked back up again on restart. You might see this issue quite often if the backend is crashing a lot, and you have long running tasks as this increases the chances that something is underway when it drops the task.

This resuming of tasks is being worked on at the moment, but it's not quite ready yet.

Interesting with the knexConfig though. Not sure that the scaffolder backend should create another client for each task or anything, it shouldn't at least.

Do you have the logs at the reason why it crashes, or do you think that it's Knex: Timeout acquiring a connection. The pool is probably full. which is causing the crash of the container?

laharshah commented 2 months ago

Observations

node:internal/process/promises:288
            triggerUncaughtException(err, true /* fromPromise */);
            ^

Error: read ETIMEDOUT
    at TLSWrap.onStreamRead (node:internal/stream_base_commons:217:20)
    at TLSWrap.callbackTrampoline (node:internal/async_hooks:128:17) {
  errno: -110,
  code: 'ETIMEDOUT',
  syscall: 'read'
}

Node.js v18.20.0

Knex: Timeout acquiring a connection. The pool is probably full.

laharshah commented 2 months ago

Not sure if this is related, found similarities with the unhandled exception error!

https://github.com/brianc/node-postgres/issues/2764

benjdlambert commented 2 months ago

@laharshah I wanna come back to some of your knex config that I've been digging around with recently as we've seen some of these errors in tests in Github Actions for some of our CI pipeline.

One thing I would say is stay away from using propagateCreateError: false. knex isn't designed to work with it and it will break some other things.

One config that we had some success with in CI builds is something like this:

knexConfig:
  acquireConnectionTimeout: 600000
  pool: 
    acquireTimeoutMillis: 10000
    createTimeoutMillis: 600000
    destroyTimeoutMillis: 600000
    idleTimeoutMillis: 600000
    acquireTimeoutMillis: 600000

The acquireConnectionTimeout outside of the pool block I think was a key part, but we haven't followed up to see which ones are actually that important. Maybe give those a try and see if it helps? :pray:

benjdlambert commented 2 months ago

It's also possible that this issue is actually pretty closely related to https://github.com/backstage/backstage/issues/19863

I wonder if this is something to do with the fact now that every backend plugin that you have installed will create a new table for the auth keys required for service-to-service auth, and it's possible that the pool is actually flooded because of each one of these plugins. How many plugins do you have installed in the backend?

laharshah commented 1 month ago

@benjdlambert Thanks for looking into!

1) knexConfig updates

2) Number of Backend plugins

laharshah commented 2 weeks ago

@benjdlambert Tried the knex config updates - but it didn't solve the crash-restarts and tasks being stuck problem. Issue is resolved with following 2 actions.

  1. Migration of the database from on-prem to the DB to Azure PostgreSQL - Flexible Server. Nodejs app was already running in AKS.
  2. With the DB switch/migration, we did the fresh ingest of all the entities and did fresh start for most of the data tables.

It is been a month we did the migration and since then

Overall both the actions helped to address underlying data and/or the network issue.