Flowpack / jobqueue-common

A base for job queue handling in Flow framework applications
MIT License
27 stars 25 forks source link

Pause worker during deployment and cache-warmup possible? #66

Open mrimann opened 1 week ago

mrimann commented 1 week ago

In a larger Neos installation we've seen multiple occasions of the following rough workflow with the result of having a log of Exception-/Stack-Trace files being generated - and detected by our monitoring, causing an alert to check the health of this particular installation due to "too many Exception Files".

So, I think I know how to work around this. And this issue is not about complaining. But while thinking around this issue, I thought it might be cool to have an option to like "pause" the worker.

Something like either a CLI command queueworker:pause and queueworker:unpause that would e.g. just write/remove a temporary file and while that file is around the worker just takes a break (e.g. does not fetch tasks from the queue to work on).

bwaidelich commented 1 week ago

Thanks for the thorough write-up. Having the worker check for some file sounds a bit error prone to me. Isn't it almost equally easy (but simpler and more resilient) to stop the worker explicitly during deployments?

mrimann commented 1 week ago

@bwaidelich doing it with a "lock-file" is of course just one option - a very rudimentary one of course. Could also be some other kind of mechanism to achieve the same result. (e.g. setting/unsetting a "feature-flag" in the Database or in a key-value store like Redis would be an alternative, but that would tie the package to some external dependencies, that's why I thought of a simple file as a way without dependencies).

Of course really stopping the worker would be the cleaner approach. But at least in our setup this would not work as easy:

I see the following two issues that could occur if going this way:

So all in all, I think it's still a good idea to make the application itself (the worker) aware of it's pause and that it should just idle for a moment.

bwaidelich commented 1 week ago

Thanks for the explanation, makes sense to me (admin noob)! Let's see if someone is willing to provide a PR for this feature

mrimann commented 1 week ago

I think I'll revoke my idea (or let it open, but don't expect anyone to solve it, unless there's an option I overlook, details below).

When testing things with the troublesome installation I went in and flushed the whole Job-Queue at the beginning of the deployment pipeline with the intention to reduce the number of things a worker would try to do. Still got some exceptions during the following deployment (while the application was in "lock" mode while warming up code-caches).

This led me to the thought, that this is probably not solvable as proposed:

Moving the decision whether to invoke the worker or not to the application sounded great at first. But letting the worker start (e.g. run a Flow CLI Command from supervisord or a cronjob) is itself already triggering an exception while the application is locked and warming up the caches.

Unless this can be ignored on some low-level of the application, I don't think anymore that this is the way to to (to let the application know whether the worker shall work or pause).

Instead, I'm thinking about a solution that is "working around the cache-warmup-locking". An easy solution could be to write a (b)lock file - and then change the supervisord or cronjob command from

/path/to/flow jobqueue:executeworker

to

if [[ ! -f blockfile ]]; then /path/to/flow jobqueue:executeworker; fi

This could be beautified of course - but it should do the basic trick: E.g. check if the file exists, if it does, do nothing - if it's not there, run the queue-worker and let it do it's work.

So we could throw that (b)lockfile in at the begin of the deployment - then do everything that's needed without the worker actively executing any flow commands - and in the end remove the file to let the worker do it's work again.