geopython / pywps

PyWPS is an implementation of the Web Processing Service standard from the Open Geospatial Consortium. PyWPS is written in Python.
https://pywps.org
MIT License
178 stars 117 forks source link

More robust queue handling #493

Open jachym opened 5 years ago

jachym commented 5 years ago

As requested in the issue #455 we need to improve the queue handling.

I'm thinking about creating PyWPS deamon, which would be responsible for starting (spawning?) processes in the background, taking requests from the database table.

cehbrecht commented 4 years ago

We already introduced a service to handle the job queue in #505.

I currently working on removing stalled jobs from the queue. Stalled jobs are identified by a timeout parameter (1 day or so) and I can terminate the jobs and set the status to "failed". In addition one needs to update the status document of the job that is pulled by the client. At this point I'm stuck ... the information needed was in the pywps_stored_requests database ... but it was removed when the process was started.

Here the current work: https://github.com/geopython/pywps/compare/master...cehbrecht:dev-cancel

@jachym @davidcaron what would you recommend to solve this issue? Update the database and keep the stored request?

jachym commented 4 years ago

I would also add one of the recent requests:

Process fail can be caused not only on subjective manner (e.g. fail in the data), but also because of some objective reason (disk full).

At the moment, process request is stored only in the table pywps_requests - once the process is started, it is deleted from this table and there is now way, how to restore it.

Would be cool to change this mechanism, so that we have record of the full request.

cehbrecht commented 4 years ago

At the moment, process request is stored only in the table pywps_requests - once the process is started, it is deleted from this table and there is now way, how to restore it.

Would be cool to change this mechanism, so that we have record of the full request.

What solution would you propose? Copy the request infos to the job table? Or adding a flag to the request table with a flag/status: new/done.

davidcaron commented 4 years ago

When I implemented the function pop_first_stored, it was at a time when there was no worker, and most requests would not happen to get stored in the pywps_stored_requests table.

The problem I fixed was that the failed requests were not removed from the pywps_stored_requests table and they counted when came the time to check how many stored requests there was. The queue was filling quickly when the requests were failing.

I think like you said @cehbrecht we should either add a flag in the stored_requests table for the status (but this feels a bit weird because the jobs table is the one holding the status) or add the requests information in the jobs table for the Execute requests. I think the latter would make more sense, as I feel the purpose of the stored_requests table is to act like a message queue.

What information do you need @cehbrecht in the request, that is not currently stored in the jobs table?

cehbrecht commented 4 years ago

we need to create a wps response object to write the status document: https://github.com/geopython/pywps/blob/d05483d75e753b3cda303f5c0bb778a0f9465393/pywps/queue.py#L75 https://github.com/geopython/pywps/blob/d05483d75e753b3cda303f5c0bb778a0f9465393/pywps/response/execute.py#L43

This is the template used: https://github.com/geopython/pywps/blob/master/pywps/templates/1.0.0/execute/main.xml

davidcaron commented 4 years ago

Ok, I see... maybe without doing a database migration, we could:

This way the stored request would still be available. We could delete the row in stored_requests when the process succeeds or fails eventually.

jachym commented 4 years ago

I suggest, we use the flag and eventually remove the request, when it's successfully done (i hoe, this is n line with @davidcaron )

cehbrecht commented 4 years ago

I suggest, we use the flag and eventually remove the request, when it's successfully done (i hoe, this is n line with @davidcaron )

Currently I also would favor to set a flag in the stored_requests table. Changing the database should not be an issue. I can modify the code in this direction.

jachym commented 4 years ago

This issue is IMHO related (with proposed solution) to #491 which could be done together

huard commented 3 years ago

I'd like to revive this issue. There is a discussion planned: Thursday, April 22 · 9:00 – 10:00am (EST) Video call link: https://meet.google.com/nnv-ujjp-xao

Objective:

Agenda:

I've tried to identify the various issues that are tied to this topic.

Relevant issues:

We also have reports that mixing sync and async processes can lead to server hanging-up.

Relevant PRs:

cehbrecht commented 3 years ago

Some material:

huard commented 3 years ago

Meeting Notes

Present: @cehbrecht, @tlvu, @aulemahal, @dbyrns, @cjauvin, @huard

Action items

gschwind commented 2 years ago

Hello,

Maybe a naive solution would be a pure python daemon that serve HTTP, this daemon can be accessible by proxy. In that case the daemon can have a good control of his sub-process. This daemon can run as any regular user.

This also make pywps easy to test as standalone HTTP server.

Best regards.

gschwind commented 2 years ago

Hello,

I did some successful test regarding having a daemon accessible through a proxy. I did not change the code extensively I just looked if I can transform the current code as http server. The proxy configuration must handle wpsoutput and redirect the WPS request to the server. The code can be found in my repository: gschwind/PyWPS@058271d2ad89522e48d9e8163e4038d36b57c61b and an example of usage: gschwind/PyWPS@6f5fcb9a3606ae8b5b7a23cc68746ffbe29e4a20

As we can see the current modification are minimal, but the code currently does not solve any issue but allow to do it later.

Best regards

geotom commented 2 years ago

Hi @cehbrecht, @tlvu, @aulemahal, @dbyrns, @cjauvin, @huard

I think I have a general understanding problem regarding async processes and pyWPS. I use mainly async processes and have a NGINX/GUNICORN setup, where I configure the maximum parallel processes based on available CPUs. Now in production I also experience issues with jobs that get submitted, and accepted, but then never processed. I assume this is a queue problem. Jobs get added to the queue when I exceed my maximum parallel processes, but then are left in the (default memory sqlite). Seems that finished jobs are not removed from the queue, so pending jobs are never picket up and stay forever on 0% progress but remain accepted? My question is:

Thanks

gschwind commented 2 years ago

Hello @geotom

The memory sqlite is another issue, look at #495

gschwind commented 2 years ago

Hello,

I triggered the case where my WPS does not accept more request due to requests that are killed before their clean termination. In my case this append in out-of-memory situation, but this can be easily triggered by reboot or kill -9. The issue is that request status is updated by the process, and if the process is killed before it write that it have finished, the process will stay in that state forever, counting as running process. I don't think adding flag to the database can solve this issue.

Note that my previous suggest to use daemon does not solve the entire issue, for example reboot may leave unfinished request. I found two heuristic to fix the issue in more or less safe way. In the daemon case, we can generate a daemon session uuid, and at restart of the daemon all un-finished requests that do not belong to the session ID get marked as finished.

The second heuristic is in the case that we do not have daemon. When pywps run out of slot for running a new request he can check if the pid of the request is still alive. If not he can mark this request as failed. This solution is safe but not fully accurate because linux can reuse pid, but at less if the pid is not present, we are sure that the process is finished. To make it more accurate I thought to tag process using /proc/pid/environ, but this require use of execve, and that need change the way that sub-process are spawned. I can also use /proc/pid/cmdline and check if it does not match our /proc/self/cmdline to ensure it's not our process, because fork preserve cmdline. I guess. Notes that this solution may leave status of failled process in running state for a long time because it require a new request without available slot. Idealy this heuristic should be triggered each time the user request the status of the process, which it is not possible in current implementation of the status.

I will try to implement the last method without tag, which should solve 99% of the issue I guess.

gschwind commented 2 years ago

Following my previews comment, I did the following related pull request #659

Best regards