geopython / pywps

PyWPS is an implementation of the Web Processing Service standard from the Open Geospatial Consortium. PyWPS is written in Python.
https://pywps.org
MIT License
176 stars 117 forks source link

Better handle of crashed processes #659

Closed gschwind closed 2 years ago

gschwind commented 2 years ago

Overview

Improve the management of sub-process on linux platform.

As describe in #493 process may crash and the database keep them as running process. At some point the server does not accept new request because it reach the max parallel request limit. This patch series expect to handle this situation more properly on linux platform. The patch series must, as preliminary requirement fix an issue with the MultiProcessing module which can keep zombies processes forever, here an instance of what I can get with ps -f -u apache:

apache     21711   21702  0 14:57 ?        00:00:00 /usr/sbin/apache2 -D INFO -D LANGUAGE -D WSGI -D PROXY -DFOREGROUND
apache     21712   21702  0 14:57 ?        00:00:00 /usr/sbin/apache2 -D INFO -D LANGUAGE -D WSGI -D PROXY -DFOREGROUND
apache     21758   21704  0 14:57 ?        00:00:00 [apache2] <defunct>
apache     21884   21702  0 14:58 ?        00:00:00 /usr/sbin/apache2 -D INFO -D LANGUAGE -D WSGI -D PROXY -DFOREGROUN

I think the issue is not limited to apache. To fix the issue this patch series provide a new DetachProcessing mode that actually detach the processing of request, ensuring that the new process get become a child of pid 1. This ensure that processes will not end up as zombies. This DetachProcessing should be good for any use case on linux, i.e. other server than apache.

Thus to ensure that the patch series work the configuration must use the mode detachprocessing, other wise terminated process cannot be detected. The heuristic used in the patch series is basically to check if a process exist with the stored pid, if the pid is not here anymore, we are sure that the process is not running anymore. In case the pid still there, we are not sure, because linux may reuse the pid for another process. This is why this is a safe heuristic but not 100% accurate.

I tried several more accurate heuristic, but sometime they do not work and other time make things much more complex.

Moreover the patch check and cleanup sub-process only when pywps reach the max parralel process limit, this mean that process may be considered as not finished for a long time, i.e. until we reach the max parralel process limit. It will be better to check it at every status request also, but to implement this we require to address #354 .

Best regards

Related Issue / Discussion

This is related to #493

Additional Information

Contribution Agreement

(as per https://github.com/geopython/pywps/blob/master/CONTRIBUTING.rst#contributions-and-licensing)

coveralls commented 2 years ago

Coverage Status

Coverage decreased (-0.3%) to 81.053% when pulling 57b6301621f45cf64eb409cd5c25305463226402 on gschwind:new-handle-process into 7839f5276ced486a8b7f83379e919fb05273b4e7 on geopython:main.

gschwind commented 2 years ago

Hello,

While the patch improve the situation, the patch leave an issue that the status file of crashed process is not updated.

The issue may be difficult to solve because how storage are currently handled. In my case where status file are stored as file, I may solve the issue, but in general cases the #354 will be the solution.

Best regards