New module odoosh_queue_job for running vanilla queue_job on odoo.sh

PCatinean commented 1 year ago

This is a topic that has been discussed in multiple threads before (namely https://github.com/OCA/queue/issues/169 https://github.com/OCA/queue/pull/256 https://github.com/OCA/queue/issues/244) and they are all focused around the problems of running queue_job on odoo.sh which has a different architecture similar to HA. As a consequence the current accepted method for running queue_job on odoo.sh is via the module queue_job_cron_jobrunner (https://github.com/OCA/queue/pull/415) which works but in a limited fashion.

To spare this PR of a very long text reiterating all the issues in detail I will just make a referral to a post I wrote regarding this topic: https://pledra.com/blog/tech-4/running-queue-job-on-odoo-sh-1

In short this module uses the standard queue_job and the http workers by employing a leader election system leveraging the connection info in pg_stat_activity.

I have deployed this in production for a week now and from this personal experience alone it has worked flawlessly with 2 or more concurrent http workers, deployments and rebuilds, http worker restarts and automatic scaling of worker number based on traffic.

While this module was conceived to run on odoo.sh I could see a potential format to implement this in standard queue_job to support HA / multi-node deployments. Granted it will not work with connection pooling because it relies on uninterrupted active connections but then again the design of queue_job does not play well with most connection pooling modes anyway in my understanding.

The current approach is a tad hacky so I would be happy to hear your feedback and suggestion on it and how feasible or safe you think it is. There could be multiple scenarios I did not take into consideration here so let me know.

guewen commented 11 months ago

Hi @PCatinean , first of all, thanks for your post, it's instructing and explains very well the context.

Now, that's a clever and yet so simple idea, I love it.

Some points to think about:

shouldn't we set a short idle_in_transaction_session_timeout when opening the connection, to ensure it's closed if case the client (jobrunner) doesn't stop properly?
I'm not sure it behaves properly regarding multi databases / shared postgres, and maybe needs to be refined in this regard

I elaborate the second point:

            SELECT substring(application_name FROM 'jobrunner_(.*)')
            FROM pg_stat_activity
            WHERE application_name LIKE 'jobrunner_%'
            ORDER BY backend_start
            LIMIT 1;

This query will look for all jobrunners, so if you have a shared postgres server with many databases, and several jobrunners for different databases or different sets of databases, AFAIU only one jobrunner will be allowed to run.

Maybe instead of a uuid, the identifier should be a hash of db_name, or (but the application_name is limited to 64 chars).

That's something to think about! But if we can solve this properly, that's something that should be in queue_job, if only for the HA!

Sidenote on:

Granted it will not work with connection pooling because it relies on uninterrupted active connections but then again the design of queue_job does not play well with most connection pooling modes anyway in my understanding.

NOTIFY/LISTEN won't work through connection pool (or using the session pooling mode but then pooling is a bit useless). But the NOTIFY part is not an issue, only the LISTEN part. So it is totally possible to setup connection pooling: the workers go through the pool, the jobrunner bypass it. But anyway it's all the same with or without this pg_stat_activity leader election mechanism.

PCatinean commented 11 months ago

Hi @guewen, thanks a lot for the review and support, coming from you it's quite gratifying :)

I will address the points one by one:

shouldn't we set a short idle_in_transaction_session_timeout

Absolutely, If I'm not mistaken the jobrunner has only short-lived and small transactions (updating the individual status of jobs mainly). All other bulky/long ones are from Odoo directly (like setting many jobs to done/cancel etc). Not sure how low you think we can go without any potential disruption, 1 minute or even less?

The neat part of this design is that even if the process is abruptly terminated (I even did tests with kill -9 on odoo.sh workers) the pg_stat_activity table updates quite quickly. Still, no harm in setting this param as well and adding one more security layer.

I'm not sure it behaves properly regarding multi databases / shared postgres, and maybe needs to be refined in this regard

Indeed it will not, I forgot to mention this explicitly here but I created this module as a workaround to the limitations of odoo.sh even though it has mainly the same obstacles as a multi-node deployment. I was not sure about how stable/robust this approach was so I did not invest into HA. Now with a seal of approval I can definitely work in this direction and try to incorporate the logic directly in the queue_job and take the db_name into account as well. The pg_stat_activity table also has datname as a column and I think we can use that to work around this.

Connection pooler

I did not think about the possibility of the jobrunner to bypass the connection pooling. All the connection details are configurable and in that case everything should indeed be fine (maybe worth mentioning in the README). If the jobrunner goes through the same connection pool however I think depending on the configuration there can be some really unwated results (application_name not propagated/updated to the active connections, usage of a different open connection from the pooler etc). This would break the entire logic unfortunately.

In conclusion I will try to adapt this functionality to support HA in the standard queue_job instead of a separate module. Then we can run some tests (odoo.sh + multi-node) and see how everything works out :)

drewes commented 11 months ago

Thanks @PCatinean. Was in a blind panic today with queues stopping on Odoo.sh until I found your PR. Works like charm. Wish there was a OCA-based hosting environment with custom Postgres extensions.

anhvu-sg commented 11 months ago

this module save my life :), many thanks

FlorianMgs commented 10 months ago

Hi there, what are the exact steps to make this work?
On my odoo.sh instance, jobs keeps resetting in loop from enqueued to pending:

│2023-12-12 08:28:23,105 46319 WARNING ? odoo.addons.queue_job.jobrunner.runner: state of job a828457a-36fa-4839-9099-afef4b52cd44 was reset from enqueued to pending                                                                     
│2023-12-12 08:28:23,106 46319 INFO ? odoo.addons.queue_job.jobrunner.runner: asking Odoo to run job a828457a-36fa-4839-9099-afef4b52cd44 on db <my-odoosh-db>-10838159

Here's my odoo.conf:

[options]
dbfilter=
server_wide_modules = base,web,odoosh_queue_job
xmlrpc_interface=<my-odoosh-domain>.odoo.com

Adding queue_job to server_wide_modules, setting the number of workers as well as:

[queue_job]
channels=root:2

Changes nothing.

I had to set xmlrpc_interface to my actual odoosh domain otherwise I was getting a connection error. Both queue_job and odoosh_queue_job are installed. Thank you 🙏🏻

PCatinean commented 10 months ago

Hi @FlorianMgs

I actually missed mentioning the queue_job configuration part in the README.md (added via a commit now). I placed it inside of my article and forgot about the readme.

What you need to add on top are the queue_job configuration parameters.

Remember that if you have multiple environments (staging, dev branches) you need to update the host for each. I also need to post the module we use that does this switch automatically as well so you don't have to manually edit the odoo.conf with every new build.

   [queue_job]
   scheme=https
   host=<your-odoosh-domain>.odoo.com
   port=443

You should see the jobs being processed in your logs, you can use tail -f /home/odoo/logs/odoo.log

FlorianMgs commented 10 months ago

It's working now. Thank you very much for the quick update @PCatinean ! 🙏🏻

PCatinean commented 10 months ago

Given this PR might already be used in some instances already and the scope of the final module is different, I made a separate PR -> https://github.com/OCA/queue/pull/607.

When you get the chance @guewen your feedback is greatly appreciated.

AEstLo commented 8 months ago

Good job! Any plans to merge this module? It works

suniagajose commented 8 months ago

any news @guewen, do you think that it can be merged, soon?

I tried it and it works....

thanks,

cc @moylop260

PCatinean commented 8 months ago

Btw this module can be superseded by this approach which can handle odoo.sh as well as multi-node scenarios: https://github.com/OCA/queue/pull/607

I also tested that PR and worked well for odoo.sh

github-actions[bot] commented 4 months ago

There hasn't been any activity on this pull request in the past 4 months, so it has been marked as stale and it will be closed automatically if no further activity occurs in the next 30 days. If you want this PR to never become stale, please ask a PSC member to apply the "no stale" label.

PCatinean commented 4 months ago

Closing PR for the reasons explained in https://github.com/OCA/queue/pull/607

OCA / queue

New module odoosh_queue_job for running vanilla queue_job on odoo.sh #562