dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.2k stars 8.71k forks source link

Gunicorn preload flag not working with latest XGBoost version (post 1.6.0) #8040

Open anatolec opened 2 years ago

anatolec commented 2 years ago

We have an XGBoost model served via a Flask app on Heroku. This Flask app is launched using gunicorn. We are using the --preload option of gunicorn in order to mutualize memory consumption of the several workers that are launched (4).

This setup was working well until the upgrade to version 1.6.0 when it stopped working. Now the predict function of our XGBClassifier hangs forever.

Environment:

Flask==2.1.2
gunicorn==20.1.0
numpy==1.22.4
xgboost==1.6.1

Gunicorn command: web: gunicorn --preload -w 4 run:app

Error:

[2022-06-29 18:18:31 +0200] [17983] [CRITICAL] WORKER TIMEOUT (pid:17984)
[2022-06-29 18:18:32 +0200] [17983] [WARNING] Worker with pid 17984 was terminated due to signal 9
trivialfis commented 2 years ago

I'm not familiar with these packages, would be great if someone with such expertise can help take a look.

josiahkhor commented 2 years ago

@AnatolePledg do you get the hanging forever if you test with web: gunicorn --preload -w 1 run:app? Or if using an inplace_predict instead?

I started experiencing a similar hanging with XGBClassifier after upgrading to 1.6.1 (but I was running inside a worker process in Heroku), but my predict calls were inside a multiprocessing.Pool which I think is surfacing known issues with threading/workers and predict:

https://github.com/dmlc/xgboost/issues/4246 https://github.com/dmlc/xgboost/issues/7044

and a couple of others. Curious why only 1.6 started surfacing issues now and not before, but I wonder if gunicorn's model for creating workers is creating a similar conflict for you.

anatolec commented 2 years ago

Hi @josiahkhor, -w 1 and inplace_predict do not fix the problem. It still hangs forever. It's really the activation of the--preload option that causes the issue.

maroshmka commented 1 year ago

hey, we're experiencing the same issue.

gunicorn uses os.fork() to spawn a new worker - https://github.com/benoitc/gunicorn/blob/master/gunicorn/arbiter.py#L567 - which I suppose is unix fork in case you're running linux in docker.

I was able to find this issue - https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58378 - which somehow explains the problem. Also, it says it was fixed here - https://bugs.python.org/issue8713 - by adding "forkserver", unfortunately only in multiprocessing, which gunicorn is not using :( - as it is calling the "raw" API.

dunno what should be the fix, I don't see a way to ask for forkserver with os module in python. in addition, the problem might lie somewhere else as well. as it is still unknown why it started to be an issue in >1.6.0

lemme know what you think

Raemi commented 1 week ago

I think I found a workaround for this specific problem with Gunicorn and xgboost>2.0.0. OpenMP 5 introduced a function that frees up all resources, which we can use before a call to os.fork(): https://www.openmp.org/spec-html/5.0/openmpsu153.html

In particular, the following pre_fork() hook seems to work. Put it into a hooks.py file and pass the file as an argument to gunicorn with "-c", "hook.py":

import ctypes
from xgboost.libpath import find_lib_path

def pre_fork(server, worker) -> None:  # type: ignore
    lib_xgboost = find_lib_path()
    print(f"Found libxgboost: {lib_xgboost}")

    if not lib_xgboost:
        print("Cannot release OpenMP resources before fork.")
    else:
        libc = ctypes.CDLL(lib_xgboost[0])
        OMP_PAUSE_SOFT = 1
        libc.omp_pause_resource_all.restype = ctypes.c_int
        libc.omp_pause_resource_all.argtypes = [ctypes.c_int]
        kind = ctypes.c_int(OMP_PAUSE_SOFT)
        result = libc.omp_pause_resource_all(kind)
        print(f"Called omp_pause_resource_all with kind={kind.value}, result={result}")

In fact, instead of doing it in a pre_fork() hook, it works fine for me if I do the above directly after loading the xgboost model. Loading the model is the last operation related to xgboost in the main process before the worker processes are created by os.fork().

Maybe a function like prepare_for_fork() could be an addition to the sklearn interface, depending on OpenMP version 5 availability (OpenMP 5 release).

More generically (for other libraries that use OpenMP and have the same problem), find_lib_path() can be replaced with something like this function (for Linux):

def find_lib_gomp() -> list[str]:
    pid = psutil.Process()

    # Get the list of loaded shared libraries (memory mappings)
    return [lib.path for lib in pid.memory_maps() if "libgomp" in lib.path]
trivialfis commented 4 days ago

Thank you for sharing. Yes, fork() can be a problem for both openmp and cuda as described in the related issues linked in the comment by @josiahkhor . The best way to workaround it is simply using a pre-fork library like loky, or the forkserver shared by @maroshmka .

The workaround by @Raemi is interesting, but I suspect that it's not quite robust for the long term.

The issue is a duplication of https://github.com/dmlc/xgboost/issues/7044#issuecomment-1039912899 .