Setup process for Gearman

j3nsch commented 2 years ago

What would the setup process for Gearman look like? Right now we have Cron scripts, that need to be setup separately. Every time a new script is added it needs to be setup. What would be necessary if we use Gearman? This issue is not about every little implementation detail, but the bigger picture, the concepts, not the code.

We have Gearman, we have an OPUS 4 application, a database and scripts that perform tasks, like extracting the text from a PDF and indexing a document. What is necessary to run a job like that with Gearman? What about adding a second job skript?

j3nsch commented 2 years ago

@kaustabhbarman I need a description of this for a meeting next week. So this has the highest priority right now. What needs to be done to make background processing with Gearman available and how can new types of background processes added?

kaustabhbarman commented 2 years ago

Let's consider we want to perform only one job, let's say extracting the text from a PDF using Gearman. The main components of Gearman are:

Servers to receive and store jobs in a queue (also known as a manager)
Clients that submit jobs to the manager
Workers that process these jobs

Some useful facts to know beforehand :

Gearman provides client and worker APIs that our applications call to talk with the Gearman job server (also known as Gearmand) so we don’t need to deal with networking or mapping of jobs [1].
Internally, the gearman client and worker APIs communicate with the job server using TCP sockets [2].
When a worker class is defined and is registered with a job, that worker class needs to be run atleast once to listen to incoming jobs from Gearmand. Usually, the worker class has an infinite loop so it keeps listening unless stopped intentionally.
Worker classes can be written in entirely different applications/languages and put on a separate machine (or cluster of machines) that are better suited to do the work. They just need to connect to the server using TCP sockets.

In order to implement the text extraction with Gearman:

The first step would be to install and start a Gearman server. We can put the set-up of Gearmand in Application using shell script(like it is for Solr) or we can keep it as a separate and manual installation process.
Next step will be to setup Gearman workers. We can setup a new OPUS 4 library for Gearman workers. The existing text extraction script can be moved here. Now we can define our own structure of workers, we can start multiple workers for the same job, and also add new job scripts. Although, this library will require an interface to start the workers, because as mentioned before, the workers need to be run atleast once to start listening.
Lastly, have classes in Application that use the GearmanClient API to send job requests to the running Gearmand.

I think the important thing here to keep in mind is that we have to start the workers at some point manually (usually before the job request is made from client API, but there are also options to do it after the request has been made). And I think we cannot depend on pipelining the kick-off of a worker with a user request because it's an infinite loop inside the worker so the request wouldn't return anything.

j3nsch commented 2 years ago

Thanks, unfortunately that does not sound too promising after all. If we have to run workers (scripts) listing to Gearman, we are not much better off than before. Also PHP is not a good language for long running processes. The workers would be like demon processes and with PHP those tend to accumulate memory usage.

So how can we make this easy and robust for out example with the background extraction? Isn't there any solution for running scripts in the background with PHP?

j3nsch commented 2 years ago

In the end I am not committed to Gearman. We need a solution for the background extraction, first. Second it should be a solution where we can add another background script for a different purpose without it having to be setup separately.

The second part could be done by having a generic worker that then actually pick the proper class for a job. We could then add more job classes, without having to setup new workers. The type of job would be part of the information transmitted.

kaustabhbarman commented 2 years ago

I think I should also mention that Gearman client does have a function that won't wait for a worker to complete a job. It can return something instantly and send the job in a queue to be executed when the server finds an idle worker. The part where I said that a worker can be started after the request relates to it. But the fact still remains that a worker needs to be running like a daemon process to listen to job requests. I can look for some other solutions for running scripts in the background from next week, but I think that would be too late for your meeting next week.

j3nsch commented 2 years ago

I have got enough for the meeting. Thank you for the summary. Yes, you should look for another solution. Gearman seems like a good solution for parallelizing tasks or distributing them across systems, but that isn't our current goal.

OPUS4 / application

Setup process for Gearman #475