Closed laharshah closed 2 weeks ago
Hmm, what's going on in your custom version of the TaskScheduler
? Wondering if there's some things which are not caught there which is causing the backend to crash.
Upgrading the new Backend System will definitely help you in the long run, but not sure it's going to help with this particular failure - the Backstage version is also pretty outdated, but I think that this has been stable in the Scaffolder for some time, so want to work out why that error is not being caught and where it's coming from in the first place.
Hmm, what's going on in your custom version of the
TaskScheduler
? Wondering if there's some things which are not caught there which is causing the backend to crash.
TaskScheduler
to pass into the plugin's environment and then it is used by the Catalog - that is to run some custom entity providers, some cron job - I reviewed the code and exceptions are handled to not crash the pod/replica. env.scheduler
no sure if search plugin might have anything to do with unhandled exception. Upgrading the new Backend System will definitely help you in the long run, but not sure it's going to help with this particular failure - the Backstage version is also pretty outdated, but I think that this has been stable in the Scaffolder for some time, so want to work out why that error is not being caught and where it's coming from in the first place.
It seems scaffolder plugin has its own way to process the tasks, - I reviewed the code high level in backstage repo(here), I am not sure if from consumer perspective how do I debug on why the TASK is stuck with status = processing
never get to next stage like Failed
or Complete
? When it end up having a LOT tasks with status = processing
what happens - and how can I clean them up or put some logger to find out the reason it is not getting processed? (I updated initial post with this point.)
One more item, I have following knexConfig
values, now thinking of increasing these values. and making the propagateCreateError: true
- any thoughts?
knexConfig:
pool:
min: 15
max: 30
acquireTimeoutMillis: 60000
createTimeoutMillis: 30000
destroyTimeoutMillis: 5000
idleTimeoutMillis: 60000
reapIntervalMillis: 1000
createRetryIntervalMillis: 200
propagateCreateError: false
some history on this
knexConfig
knexConfig
, to solve KnexTimeoutError
shown below - it was throwing this error at time of deployment where K8S starts the new pod while there are existing multiple pods already in running state, for some reason when we introduced the knexConfig
it resolved that issue at time of deployment.KnexTimeoutError: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call?
at /app/node_modules/@backstage/backend-common/dist/index.cjs.js:1502:17
at async KeyStores.fromConfig (/app/node_modules/@backstage/plugin-auth-backend/dist/index.cjs.js:2565:35)
at async Object.createRouter (/app/node_modules/@backstage/plugin-auth-backend/dist/index.cjs.js:2749:20)
at async createPlugin$h (/app/packages/backend/dist/index.cjs.js:72:10)
at async main (/app/packages/backend/dist/index.cjs.js:2067:26)
So one thing to note is that if you have tasks running in the scaffolder, and the backend crashes, which might not be the fault of the scaffolder, it might die for another reasons, some other plugin for example as the infra is shared, then this will cause the task to get stuck as right now tasks will not get picked back up again on restart. You might see this issue quite often if the backend is crashing a lot, and you have long running tasks as this increases the chances that something is underway when it drops the task.
This resuming of tasks is being worked on at the moment, but it's not quite ready yet.
Interesting with the knexConfig
though. Not sure that the scaffolder backend should create another client for each task or anything, it shouldn't at least.
Do you have the logs at the reason why it crashes, or do you think that it's Knex: Timeout acquiring a connection. The pool is probably full.
which is causing the crash of the container?
Observations
node:internal/process/promises:288
triggerUncaughtException(err, true /* fromPromise */);
^
Error: read ETIMEDOUT
at TLSWrap.onStreamRead (node:internal/stream_base_commons:217:20)
at TLSWrap.callbackTrampoline (node:internal/async_hooks:128:17) {
errno: -110,
code: 'ETIMEDOUT',
syscall: 'read'
}
Node.js v18.20.0
Knex: Timeout acquiring a connection. The pool is probably full.
knexConfig
we were able to resolve the "deployment time crash" but the ongoing intermittent crash continues. knexConfig
(reason: when I run some query against database to see the tasks count where status = processing, count has increased in good amount only after the release that includes knexConfig
update to solve other problem, we also enabled the debug logs but nothing interesting to blame)Not sure if this is related, found similarities with the unhandled exception error!
@laharshah I wanna come back to some of your knex
config that I've been digging around with recently as we've seen some of these errors in tests in Github Actions for some of our CI pipeline.
One thing I would say is stay away from using propagateCreateError: false
. knex
isn't designed to work with it and it will break some other things.
One config that we had some success with in CI builds is something like this:
knexConfig:
acquireConnectionTimeout: 600000
pool:
acquireTimeoutMillis: 10000
createTimeoutMillis: 600000
destroyTimeoutMillis: 600000
idleTimeoutMillis: 600000
acquireTimeoutMillis: 600000
The acquireConnectionTimeout
outside of the pool
block I think was a key part, but we haven't followed up to see which ones are actually that important. Maybe give those a try and see if it helps? :pray:
It's also possible that this issue is actually pretty closely related to https://github.com/backstage/backstage/issues/19863
I wonder if this is something to do with the fact now that every backend plugin that you have installed will create a new table for the auth keys required for service-to-service auth, and it's possible that the pool is actually flooded because of each one of these plugins. How many plugins do you have installed in the backend?
@benjdlambert Thanks for looking into!
1) knexConfig
updates
2) Number of Backend plugins
backstage_plugin_auth > signing_keys
table knexConfig
@benjdlambert Tried the knex config updates - but it didn't solve the crash-restarts and tasks being stuck problem. Issue is resolved with following 2 actions.
It is been a month we did the migration and since then
Overall both the actions helped to address underlying data and/or the network issue.
📜 Description
There could be 2 separate issue or has some link to each other.
Intermittent crash and restart
Scaffolder template Task stuck with the "processing" status. Intermittent thing -
/create/tasks/{taskId}
/api/scaffolder/v2/tasks/{taskId}/eventstream
tasks
table has entry with the task inprocessing
status but thetask_events
has no entry where it is trying to fetch event stream fromwe are not sure if the crash/restart and scaffolder tasks not processing has some relation or not.
👍 Expected behavior
Whatever the process is making the pod crash, the exception should have handled gracefully with proper logging instead crashing the pod.
Also the unhandled exception is not clearly indicating what could have cause this.
👎 Actual Behavior with Screenshots
We are running 2 replicas in AKS for the internal backstage instance.
Issue 1: All we see in the log of the previously crashed pod is below:
Issue 2: on scaffolder task stays in processing (Intermittent) sometimes only - not always and also with not specific template, so it is hard to narrow down the cause.
👟 Reproduction steps
It is quite unique to our deployed backstage instance We are not able to reproduce in local or some lower environment, of course the production is most used instance, we have noticed it happening in UAT but not that often - assuming traffic is low as only dev team is using it.
There is no specific steps - it happens 5-6 times crash restart everyday, there is no pattern in time but one thing we have limited knowledge about it the Task runner
📃 Provide the context for the Bug.
No response
🖥️ Your Environment
Index.ts
TaskScheduler
and I wonder if that and the 2 replica/instances running has something to do with the scaffolder tasks not running well.processing
never get to next stage likeFailed
orComplete
? When it end up having a LOT tasks with status =processing
what happens - and how can I clean them up or put some logger to find out the reason it is not getting processed?👀 Have you spent some time to check if this bug has been raised before?
🏢 Have you read the Code of Conduct?
Are you willing to submit PR?
No, but I'm happy to collaborate on a PR with someone else