Gunicorn preload flag not working with PyTorch library

hsilveiro commented 3 years ago

Hello, We have been developing a FastAPI application where we use some external libraries to perform some NLP tasks, such as tokenization. On top of this, we are launching the service with Gunicorn so that we can parallelize the requests. However, we are having difficulties using Stanza with Gunicorn’s preload flag active. It is a requirement to use this flag because since Stanza models can be large, we want the models to be loaded only once, in the Gunicorn master process. This way, Gunicorn workers can access the models that were previously loaded in the master process.

The difficulties that we are facing resumes on the fact that Gunicorn workers hang when trying to make an inference over a given model (that was loaded initially by the master process).

We’ve done some research and debugging but we weren’t able to find a solution. However, we noticed that the worker hangs when the code reaches the prediction step on PyTorch. Although we are talking about Stanza, this problem also occurred with Sentence Transformers library. And both of them are using the PyTorch library.

Following, I’ll present more details:

Environment:

FastApi version: 0.54.2
Gunicorn version: 20.0.4
Uvicorn version: 0.12.3
Python version: 3.7
Stanza version: 1.1.1
OS: macOS Catalina 10.15.6

Steps executed:

Gunicorn command:

gunicorn --workers 1 --worker-class uvicorn.workers.UvicornWorker --max-requests=0 --max-requests-jitter=0 --timeout=120 --keep-alive=2 \
 --log-level=info --access-logfile - --preload -b 0.0.0.0:8010 my_app:app

The code that will run before launching the workers


def initialize_application() -> None:
 ...
 model = stanza.Pipeline(
            lang=cls._TOKENIZER_MODEL_LANGUAGES[language],
            package=cls._MODEL_TYPE[language],
            processors=cls._TOKENIZER_MODEL,
            tokenize_no_ssplit=True,
        )


This way, the model can be loaded only once, in the master process. 
Once the required workers are launched, they should have access to the previous model, without having to load it by themselves (saving computational resources).

The problem happens when we receive a request that will make use of the model that was initially loaded. The worker that will be responsible for handling the request, won’t be able to use the model for inference. As so, the worker will be hanged until the timeout occurs.

After analyzing the code and debugging it, we reached the following step until the code stopped working:

1. Our code has a call to the `process()` method, `class Pipeline`, on the core.py file of Stanza.
2. That line calls the specific `process()` method, in this case from the tokenize_process.py, `class TokenizeProcessor`
3. Which calls the PyTorch code, `output_predictions()` method, from the utils.py
4. After some steps, it reaches the model.py file still in PyTorch, `class Tokenizer(nn.Module)`, `forward(self, x, feats) method`, in the following line: `nontok = F.logsigmoid(-tok0)`. It seems that this line is calling some C++ code where we didn’t investigate any further.

Of course, if we remove the --preload flag, everything will run smoothly. Removing it is something that we want to avoid because of the added computational resources that will be necessary (the models will be duplicated in every worker).

We looked through several other issues that could be related to this one, such as:
https://github.com/benoitc/gunicorn/issues/2157
https://github.com/tiangolo/fastapi/issues/2425
https://github.com/tiangolo/fastapi/issues/596
https://github.com/benoitc/gunicorn/issues/2124
and others...

After trying multiple solutions, it wasn’t possible to solve the issue. Do you have any suggestions to handle this? Or other tests that I can perform to give you more information?

Thanks in advance.

P.S.: I also opened issues on the Stanza and PyTorch github pages:
- https://github.com/stanfordnlp/stanza/issues/570
- https://github.com/pytorch/pytorch/issues/49555

jamadden commented 3 years ago

Is the underlying C library known to be fork-safe? Not all libraries can survive a fork. For example, if they hold a lock at the time of forking, it will never be unlocked in the child processes so they will simply stop, unable to acquire the lock.

On macOS, many system libraries are not fork safe. Theres at least one environment variable you can set (export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES) that helps with some of that, but IIRC what that does is change from an outright crash into a maybe-it-works-maybe-it-doesn't situation, depending on the state of the process. (google that variable for more.)

Getting a native stack trace of a hung worker could provide insight. On most platforms, I would suggest using py-spy, but py-spy can't get native stack traces on macOS. You could try using Activity Monitor to "Sample Process" the hung worker; that might reveal something.

jamadden commented 3 years ago

I'll add that some libraries offer APIs to call after a fork to fix up the process state (e.g., gevent has some after_fork or reinit functions—it mostly arranges to call those automatically but sometimes they have to be called manually). You might look for one of those and if you find it call it in a gunicorn hook.

hsilveiro commented 3 years ago

Is the underlying C library known to be fork-safe? Not all libraries can survive a fork. For example, if they hold a lock at the time of forking, it will never be unlocked in the child processes so they will simply stop, unable to acquire the lock.

On macOS, many system libraries are not fork safe. Theres at least one environment variable you can set (export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES) that helps with some of that, but IIRC what that does is change from an outright crash into a maybe-it-works-maybe-it-doesn't situation, depending on the state of the process. (google that variable for more.)

Getting a native stack trace of a hung worker could provide insight. On most platforms, I would suggest using py-spy, but py-spy can't get native stack traces on macOS. You could try using Activity Monitor to "Sample Process" the hung worker; that might reveal something.

Hi again, I tested the following things based on what you said:

Updated the environment variable OBJC_DISABLE_INITIALIZE_FORK_SAFETY to YES -> The problem remained.
Used Activity Monitor to "Sample Process" the hung worker -> Looked at where the code stopped in the C code part but couldn't figure out the reason why.

I already tried using gevent and the problem remains. However, I'll look at the functions that you indicated: after_fork or reinit and see I have better results.

If you have any more suggestions, let me know.

Thanks!

jamadden commented 3 years ago

Posting the stack sample you observed would help others take a look.

hsilveiro commented 3 years ago

Sure, here is the exported file from the "Sample process" on the hung worker: sample_worker_process_preload.txt

jamadden commented 3 years ago

The last lines of the stack trace are very enlightening:

1 _PyMethodDef_RawFastCallKeywords  (in Python) + 685  [0x1033275ed]
2 torch::autograd::THPVariable_log_sigmoid(_object*, _object*, _object*)  (in libtorch_python.dylib) + 299  [0x114a3ec1b]
...
3 at::native::(anonymous namespace)::log_sigmoid_cpu_kernel(at::Tensor&, at::Tensor&, at::Tensor const&)  (in libtorch_cpu.dylib) + 994  [0x118f31f52]
4 at::internal::_parallel_run(long long, long long, long long, std::__1::function<void (long long, long long, unsigned long)> const&)  (in libtorch_cpu.dylib) + 1160  [0x11586b358]
5 std::__1::condition_variable::wait(std::__1::unique_lock<std::__1::mutex>&)  (in libc++.1.dylib) + 18  [0x7fff6a323592]
6 _pthread_cond_wait  (in libsystem_pthread.dylib) + 698 [0x7fff6d255425]
7 __psynch_cvwait  (in libsystem_kernel.dylib) + 10  [0x7fff6d194882]

The first two lines tell us that the Python code has called THPVariable_log_sigmoid(). The internal details of that function wind up at line 4, parallel_run; the name is highly suggestive that this will want to do something with threads.

Sure enough, lines 5 through 7 show this thread trying to wait for a low-level threading primitive to become available. If that never happens, this thread never proceeds. The process hangs.

This goes back to what I suggested initially:

Is the underlying C library known to be fork-safe? Not all libraries can survive a fork For example, if they hold a lock at the time of forking, it will never be unlocked in the child processes so they will simply stop, unable to acquire the lock.

It looks like that's exactly what has happened. The master gunicorn process used some API in libtorch that acquired a lock; when the process forked, that lock is still locked, and there is no way to unlock it.

If libtorch is meant to be fork-safe, there would be some way to either avoid taking that lock in the master process, or some way to reset the state in the child process. You can look for APIs to do that.

Otherwise, you may have to experiment to find out exactly how much it is safe to do in the the master process while still avoiding this problem. I recommend making sure that there are no extra threads running at the time of the fork — be sure to shutdown/cleanup/whatever all uses of this library before the fork.

tilgovi commented 3 years ago

Thanks for opening this issue. It should help others who have similar problems.

Thank you, Jason, for the clear diagnosis and helpful links.

At this time, I think there's nothing for Gunicorn to do here, and I will close this issue. Please let us know if that is a mistake.

reuben commented 2 years ago

FWIW there's a workaround for this if your goal is to prevent worst case latency on the first request to a worker: "preload" manually in your application factory. Do a forward pass yourself with some example data after you create the app instance. You won't be able to re-use pages across processes but it's an effective way to warm up the worker pool. Note that worker boot is subject to the same timeout as request handling, so you might need to bump --timeout.

ibraheem-tuffaha commented 2 years ago

I had a similar problem with PyTorch when running more than 1 worker. I simple workaround was to increase number of threads to more than 1.

lsmith77 commented 1 year ago

it seems like PyTorch shouldn't be run this way: https://github.com/benoitc/gunicorn/issues/2608#issuecomment-895776876

mmathys commented 1 year ago

Thanks for the analyis, had the same issue.

ciliamadani commented 10 months ago

@hsilveiro I'm in the same situation, how did you solve this ?

mmathys commented 10 months ago

@ciliamadani Maybe I can comment as well. I ended up initializing PyTorch after the workers have been forked. I used the post-fork hook.

lsmith77 commented 10 months ago

@ciliamadani Maybe I can comment as well. I ended up initializing PyTorch after the workers have been forked. I used the post-fork hook.

so in other words, you by-passed the preload behavor for your pytorch models

loretoparisi commented 8 months ago

@ciliamadani Maybe I can comment as well. I ended up initializing PyTorch after the workers have been forked. I used the post-fork hook.

@mmathys How did you achieved that?

I have tried

def post_fork(server, worker):
    if not hasattr( worker.app, 'backend'):
        my_shared_backend = Backend(config)
        worker.app.backend = my_shared_backend

I can see that the worker.app is a <gunicorn.app.wsgiapp.WSGIApplication object at 0x7f84f00ca460> shared instance, but the hasattr check always fail

mmathys commented 8 months ago

Actually it ended up not working @loretoparisi. Don't remember why

loretoparisi commented 8 months ago

Actually it ended up not working @loretoparisi. Don't remember why

yes thanks, in my understanding there's no other (or better) way than the preload flag with gUnicorn. Using Tornado + asyncio instead it works fine.

mmathys commented 8 months ago

Thanks @loretoparisi for the hint! We still have this issue, will try out Tornado (or another alternative framework)

benoitc / gunicorn

Gunicorn preload flag not working with PyTorch library #2478