Pause worker during deployment and cache-warmup possible?

mrimann commented 1 week ago

In a larger Neos installation we've seen multiple occasions of the following rough workflow with the result of having a log of Exception-/Stack-Trace files being generated - and detected by our monitoring, causing an alert to check the health of this particular installation due to "too many Exception Files".

normal operation, something causes things to end up in the job-queue (mostly re-indexing content)
the worker is picking up one task after the other without problems
then someone deploys a new version of the application (Surf deployment to new directory, not interfering yet)
at the end, Surf switches to the new release directory, with cold caches
caches are built up either by the deployment pipeline (CLI flow:cache:warmup call), by an HTTPS request from the outside, or from the queue-worker (also CLI)
If the worker has a lot to do, this results in many Exceptions thrown in a short period of time when it tries to do it's work, wile the application is "locked" due to the caches being warmed up.

So, I think I know how to work around this. And this issue is not about complaining. But while thinking around this issue, I thought it might be cool to have an option to like "pause" the worker.

Something like either a CLI command queueworker:pause and queueworker:unpause that would e.g. just write/remove a temporary file and while that file is around the worker just takes a break (e.g. does not fetch tasks from the queue to work on).

bwaidelich commented 1 week ago

Thanks for the thorough write-up. Having the worker check for some file sounds a bit error prone to me. Isn't it almost equally easy (but simpler and more resilient) to stop the worker explicitly during deployments?

mrimann commented 1 week ago

@bwaidelich doing it with a "lock-file" is of course just one option - a very rudimentary one of course. Could also be some other kind of mechanism to achieve the same result. (e.g. setting/unsetting a "feature-flag" in the Database or in a key-value store like Redis would be an alternative, but that would tie the package to some external dependencies, that's why I thought of a simple file as a way without dependencies).

Of course really stopping the worker would be the cleaner approach. But at least in our setup this would not work as easy:

Webserver running Ubuntu Linux
The worker is ran by supervisord
Both the deployment of the application and their PHP processes are ran with an unpriviledged account
Service-Configuration (including supervisord's config) are managed by Puppet

I see the following two issues that could occur if going this way:

Stopping the supervisord overall or just single jobs would probably require some sudo-config to have enough permissions to do so
Given the default "Puppet performs a run every 30min" interval, chances are likely that puppet would overwrite + restart the disabled jobs or services to re-establish the desired target-state

So all in all, I think it's still a good idea to make the application itself (the worker) aware of it's pause and that it should just idle for a moment.

bwaidelich commented 1 week ago

Thanks for the explanation, makes sense to me (admin noob)! Let's see if someone is willing to provide a PR for this feature

mrimann commented 1 week ago

I think I'll revoke my idea (or let it open, but don't expect anyone to solve it, unless there's an option I overlook, details below).

When testing things with the troublesome installation I went in and flushed the whole Job-Queue at the beginning of the deployment pipeline with the intention to reduce the number of things a worker would try to do. Still got some exceptions during the following deployment (while the application was in "lock" mode while warming up code-caches).

This led me to the thought, that this is probably not solvable as proposed:

Moving the decision whether to invoke the worker or not to the application sounded great at first. But letting the worker start (e.g. run a Flow CLI Command from supervisord or a cronjob) is itself already triggering an exception while the application is locked and warming up the caches.

Unless this can be ignored on some low-level of the application, I don't think anymore that this is the way to to (to let the application know whether the worker shall work or pause).

Instead, I'm thinking about a solution that is "working around the cache-warmup-locking". An easy solution could be to write a (b)lock file - and then change the supervisord or cronjob command from

/path/to/flow jobqueue:executeworker

to

if [[ ! -f blockfile ]]; then /path/to/flow jobqueue:executeworker; fi

This could be beautified of course - but it should do the basic trick: E.g. check if the file exists, if it does, do nothing - if it's not there, run the queue-worker and let it do it's work.

So we could throw that (b)lockfile in at the begin of the deployment - then do everything that's needed without the worker actively executing any flow commands - and in the end remove the file to let the worker do it's work again.

Flowpack / jobqueue-common

Pause worker during deployment and cache-warmup possible? #66