Open mrimann opened 1 week ago
Thanks for the thorough write-up. Having the worker check for some file sounds a bit error prone to me. Isn't it almost equally easy (but simpler and more resilient) to stop the worker explicitly during deployments?
@bwaidelich doing it with a "lock-file" is of course just one option - a very rudimentary one of course. Could also be some other kind of mechanism to achieve the same result. (e.g. setting/unsetting a "feature-flag" in the Database or in a key-value store like Redis would be an alternative, but that would tie the package to some external dependencies, that's why I thought of a simple file as a way without dependencies).
Of course really stopping the worker would be the cleaner approach. But at least in our setup this would not work as easy:
I see the following two issues that could occur if going this way:
So all in all, I think it's still a good idea to make the application itself (the worker) aware of it's pause and that it should just idle for a moment.
Thanks for the explanation, makes sense to me (admin noob)! Let's see if someone is willing to provide a PR for this feature
I think I'll revoke my idea (or let it open, but don't expect anyone to solve it, unless there's an option I overlook, details below).
When testing things with the troublesome installation I went in and flushed the whole Job-Queue at the beginning of the deployment pipeline with the intention to reduce the number of things a worker would try to do. Still got some exceptions during the following deployment (while the application was in "lock" mode while warming up code-caches).
This led me to the thought, that this is probably not solvable as proposed:
Moving the decision whether to invoke the worker or not to the application sounded great at first. But letting the worker start (e.g. run a Flow CLI Command from supervisord or a cronjob) is itself already triggering an exception while the application is locked and warming up the caches.
Unless this can be ignored on some low-level of the application, I don't think anymore that this is the way to to (to let the application know whether the worker shall work or pause).
Instead, I'm thinking about a solution that is "working around the cache-warmup-locking". An easy solution could be to write a (b)lock file - and then change the supervisord or cronjob command from
/path/to/flow jobqueue:executeworker
to
if [[ ! -f blockfile ]]; then /path/to/flow jobqueue:executeworker; fi
This could be beautified of course - but it should do the basic trick: E.g. check if the file exists, if it does, do nothing - if it's not there, run the queue-worker and let it do it's work.
So we could throw that (b)lockfile in at the begin of the deployment - then do everything that's needed without the worker actively executing any flow commands - and in the end remove the file to let the worker do it's work again.
In a larger Neos installation we've seen multiple occasions of the following rough workflow with the result of having a log of Exception-/Stack-Trace files being generated - and detected by our monitoring, causing an alert to check the health of this particular installation due to "too many Exception Files".
flow:cache:warmup
call), by an HTTPS request from the outside, or from the queue-worker (also CLI)So, I think I know how to work around this. And this issue is not about complaining. But while thinking around this issue, I thought it might be cool to have an option to like "pause" the worker.
Something like either a CLI command
queueworker:pause
andqueueworker:unpause
that would e.g. just write/remove a temporary file and while that file is around the worker just takes a break (e.g. does not fetch tasks from the queue to work on).