Closed hsilveiro closed 3 years ago
Is the underlying C library known to be fork-safe? Not all libraries can survive a fork. For example, if they hold a lock at the time of forking, it will never be unlocked in the child processes so they will simply stop, unable to acquire the lock.
On macOS, many system libraries are not fork safe. Theres at least one environment variable you can set (export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES
) that helps with some of that, but IIRC what that does is change from an outright crash into a maybe-it-works-maybe-it-doesn't situation, depending on the state of the process. (google that variable for more.)
Getting a native stack trace of a hung worker could provide insight. On most platforms, I would suggest using py-spy, but py-spy can't get native stack traces on macOS. You could try using Activity Monitor to "Sample Process" the hung worker; that might reveal something.
I'll add that some libraries offer APIs to call after a fork to fix up the process state (e.g., gevent has some after_fork
or reinit
functions—it mostly arranges to call those automatically but sometimes they have to be called manually). You might look for one of those and if you find it call it in a gunicorn hook.
Is the underlying C library known to be fork-safe? Not all libraries can survive a fork. For example, if they hold a lock at the time of forking, it will never be unlocked in the child processes so they will simply stop, unable to acquire the lock.
On macOS, many system libraries are not fork safe. Theres at least one environment variable you can set (
export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES
) that helps with some of that, but IIRC what that does is change from an outright crash into a maybe-it-works-maybe-it-doesn't situation, depending on the state of the process. (google that variable for more.)Getting a native stack trace of a hung worker could provide insight. On most platforms, I would suggest using py-spy, but py-spy can't get native stack traces on macOS. You could try using Activity Monitor to "Sample Process" the hung worker; that might reveal something.
Hi again, I tested the following things based on what you said:
OBJC_DISABLE_INITIALIZE_FORK_SAFETY
to YES
-> The problem remained.I already tried using gevent and the problem remains. However, I'll look at the functions that you indicated: after_fork
or reinit
and see I have better results.
If you have any more suggestions, let me know.
Thanks!
Posting the stack sample you observed would help others take a look.
Sure, here is the exported file from the "Sample process" on the hung worker: sample_worker_process_preload.txt
The last lines of the stack trace are very enlightening:
1 _PyMethodDef_RawFastCallKeywords (in Python) + 685 [0x1033275ed]
2 torch::autograd::THPVariable_log_sigmoid(_object*, _object*, _object*) (in libtorch_python.dylib) + 299 [0x114a3ec1b]
...
3 at::native::(anonymous namespace)::log_sigmoid_cpu_kernel(at::Tensor&, at::Tensor&, at::Tensor const&) (in libtorch_cpu.dylib) + 994 [0x118f31f52]
4 at::internal::_parallel_run(long long, long long, long long, std::__1::function<void (long long, long long, unsigned long)> const&) (in libtorch_cpu.dylib) + 1160 [0x11586b358]
5 std::__1::condition_variable::wait(std::__1::unique_lock<std::__1::mutex>&) (in libc++.1.dylib) + 18 [0x7fff6a323592]
6 _pthread_cond_wait (in libsystem_pthread.dylib) + 698 [0x7fff6d255425]
7 __psynch_cvwait (in libsystem_kernel.dylib) + 10 [0x7fff6d194882]
The first two lines tell us that the Python code has called THPVariable_log_sigmoid()
. The internal details of that function wind up at line 4, parallel_run
; the name is highly suggestive that this will want to do something with threads.
Sure enough, lines 5 through 7 show this thread trying to wait for a low-level threading primitive to become available. If that never happens, this thread never proceeds. The process hangs.
This goes back to what I suggested initially:
Is the underlying C library known to be fork-safe? Not all libraries can survive a fork For example, if they hold a lock at the time of forking, it will never be unlocked in the child processes so they will simply stop, unable to acquire the lock.
It looks like that's exactly what has happened. The master gunicorn process used some API in libtorch that acquired a lock; when the process forked, that lock is still locked, and there is no way to unlock it.
If libtorch is meant to be fork-safe, there would be some way to either avoid taking that lock in the master process, or some way to reset the state in the child process. You can look for APIs to do that.
Otherwise, you may have to experiment to find out exactly how much it is safe to do in the the master process while still avoiding this problem. I recommend making sure that there are no extra threads running at the time of the fork — be sure to shutdown/cleanup/whatever all uses of this library before the fork.
Thanks for opening this issue. It should help others who have similar problems.
Thank you, Jason, for the clear diagnosis and helpful links.
At this time, I think there's nothing for Gunicorn to do here, and I will close this issue. Please let us know if that is a mistake.
FWIW there's a workaround for this if your goal is to prevent worst case latency on the first request to a worker: "preload" manually in your application factory. Do a forward pass yourself with some example data after you create the app instance. You won't be able to re-use pages across processes but it's an effective way to warm up the worker pool. Note that worker boot is subject to the same timeout as request handling, so you might need to bump --timeout
.
I had a similar problem with PyTorch when running more than 1 worker. I simple workaround was to increase number of threads to more than 1.
it seems like PyTorch shouldn't be run this way: https://github.com/benoitc/gunicorn/issues/2608#issuecomment-895776876
Thanks for the analyis, had the same issue.
@hsilveiro I'm in the same situation, how did you solve this ?
@ciliamadani Maybe I can comment as well. I ended up initializing PyTorch after the workers have been forked. I used the post-fork hook.
@ciliamadani Maybe I can comment as well. I ended up initializing PyTorch after the workers have been forked. I used the post-fork hook.
so in other words, you by-passed the preload
behavor for your pytorch models
@ciliamadani Maybe I can comment as well. I ended up initializing PyTorch after the workers have been forked. I used the post-fork hook.
@mmathys How did you achieved that?
I have tried
def post_fork(server, worker):
if not hasattr( worker.app, 'backend'):
my_shared_backend = Backend(config)
worker.app.backend = my_shared_backend
I can see that the worker.app
is a <gunicorn.app.wsgiapp.WSGIApplication object at 0x7f84f00ca460>
shared instance, but the hasattr
check always fail
Actually it ended up not working @loretoparisi. Don't remember why
Actually it ended up not working @loretoparisi. Don't remember why
yes thanks, in my understanding there's no other (or better) way than the preload flag with gUnicorn. Using Tornado + asyncio instead it works fine.
Thanks @loretoparisi for the hint! We still have this issue, will try out Tornado (or another alternative framework)
Hello, We have been developing a FastAPI application where we use some external libraries to perform some NLP tasks, such as tokenization. On top of this, we are launching the service with Gunicorn so that we can parallelize the requests. However, we are having difficulties using Stanza with Gunicorn’s preload flag active. It is a requirement to use this flag because since Stanza models can be large, we want the models to be loaded only once, in the Gunicorn master process. This way, Gunicorn workers can access the models that were previously loaded in the master process.
The difficulties that we are facing resumes on the fact that Gunicorn workers hang when trying to make an inference over a given model (that was loaded initially by the master process).
We’ve done some research and debugging but we weren’t able to find a solution. However, we noticed that the worker hangs when the code reaches the prediction step on PyTorch. Although we are talking about Stanza, this problem also occurred with Sentence Transformers library. And both of them are using the PyTorch library.
Following, I’ll present more details:
Environment:
Steps executed:
Gunicorn command:
The code that will run before launching the workers