AlexiaChen / AlexiaChen.github.io

My Blog https://github.com/AlexiaChen/AlexiaChen.github.io/issues
88 stars 11 forks source link

解决Flask gunicorn的WORKER TIMEOUT报错 #174

Open AlexiaChen opened 1 year ago

AlexiaChen commented 1 year ago

服务架构是,gunicorn启动的WSGI server用Nginx做反向代理。 就是网络上说的Nginx + gunicorn + Flask的架构。

错误日志是:

[2023-04-28 01:58:09 +0000] [11] [CRITICAL] WORKER TIMEOUT (pid:15)
[2023-04-28 01:58:09,717] INFO in client: Got keepalive def03be1-9193-4219-be32-5c3caf806f6e in 10.36s
Exception ignored in: <function _ChannelCallState.__del__ at 0x7fd37fb905e0>
Traceback (most recent call last):
  File "/app/__pypackages__/3.8/lib/grpc/_channel.py", line 1247, in __del__
    self.channel.close(cygrpc.StatusCode.cancelled,
  File "src/python/grpcio/grpc/_cython/_cygrpc/channel.pyx.pxi", line 513, in grpc._cython.cygrpc.Channel.close
  File "src/python/grpcio/grpc/_cython/_cygrpc/channel.pyx.pxi", line 399, in grpc._cython.cygrpc._close
  File "src/python/grpcio/grpc/_cython/_cygrpc/channel.pyx.pxi", line 420, in grpc._cython.cygrpc._close
  File "/usr/local/lib/python3.8/threading.py", line 302, in wait
    waiter.acquire()
  File "/app/__pypackages__/3.8/lib/gevent/thread.py", line 121, in acquire
    acquired = BoundedSemaphore.acquire(self, blocking, timeout)
  File "src/gevent/_semaphore.py", line 180, in gevent._gevent_c_semaphore.Semaphore.acquire
  File "src/gevent/_semaphore.py", line 259, in gevent._gevent_c_semaphore.Semaphore.acquire
  File "src/gevent/_semaphore.py", line 249, in gevent._gevent_c_semaphore.Semaphore.acquire
  File "src/gevent/_abstract_linkable.py", line 521, in gevent._gevent_c_abstract_linkable.AbstractLinkable._wait
  File "src/gevent/_abstract_linkable.py", line 487, in gevent._gevent_c_abstract_linkable.AbstractLinkable._wait_core
  File "src/gevent/_abstract_linkable.py", line 490, in gevent._gevent_c_abstract_linkable.AbstractLinkable._wait_core
  File "src/gevent/_abstract_linkable.py", line 442, in gevent._gevent_c_abstract_linkable.AbstractLinkable._AbstractLinkable__wait_to_be_notified
  File "src/gevent/_abstract_linkable.py", line 451, in gevent._gevent_c_abstract_linkable.AbstractLinkable._switch_to_hub
  File "src/gevent/_greenlet_primitives.py", line 61, in gevent._gevent_c_greenlet_primitives.SwitchOutGreenletWithLoop.switch
  File "src/gevent/_greenlet_primitives.py", line 65, in gevent._gevent_c_greenlet_primitives.SwitchOutGreenletWithLoop.switch
  File "src/gevent/_gevent_c_greenlet_primitives.pxd", line 35, in gevent._gevent_c_greenlet_primitives._greenlet_switch
gevent.exceptions.LoopExit: This operation would block forever
        Hub: <Hub '' at 0x7fd38477b220 epoll default pending=0 ref=0 fileno=6 resolver=<gevent.resolver.thread.Resolver at 0x7fd383daf100 pool=<ThreadPool at 0x7fd380a6c740 tasks=0 size=0 maxsize=10 hub=<Hub at 0x7fd38477b220 thread_ident=0x7fd385f4b740>>> threadpool=<ThreadPool at 0x7fd380a6c740 tasks=0 size=0 maxsize=10 hub=<Hub at 0x7fd38477b220 thread_ident=0x7fd385f4b740>> thread_ident=0x7fd385f4b740>
        Handles:
[]

在线上发现一个现象,一个http请求Python Flask写的REST API 服务被Block住了很久,我把gunicorn的timeout配置加大也不行。试了这个 https://stackoverflow.com/questions/10855197/frequent-worker-timeout 链接里面的各种方法,包括把preload设置为True也不行。

gunicorn的配置:

import multiprocessing

bind = "0.0.0.0:5000"
workers = multiprocessing.cpu_count() * 2 + 1
worker_class = "gevent"
loglevel = "info"

后来仔细想了下,为什么其他的Http REST API接口并没有这么被block住超时的情况,我想了下,是这个API又调用了stable diffusion的gRPC的API,不是stable diffustion的REST API。然后我的gunicorn的worker_class又是gevent的配置,如果是默认的sync配置,则没有以上问题。但是我的服务端的场景,更推荐用async的gevent啥的。所以我就尝试Google了grpc gevent gunicorn相关的关键词,终于找到了,原来是gevent和grpc根本不兼容导致的。

要在你的Flask入口程序,比如 app.py的import标准库之前(文件头的最开始处)写以下兼容性的补丁代码:

from gevent import monkey
monkey.patch_all()

import grpc.experimental.gevent as grpc_gevent
grpc_gevent.init_gevent()

# import a bunch of standard packages

我想着这个代码比较丑陋,而且相关的兼容性issue也比较早了。我的gevent和grpc版本应该不老,按理来说早就被开源社区修复掉了,而且他们说已经解决,这个只是临时补丁而已。没想到居然还是通过这个补丁解决了,既然已经解决,我就不具体去查那个版本解决的了。

打完补丁后,gunicorn的配置,建议改成如下:

import multiprocessing

bind = "0.0.0.0:5000"
workers = multiprocessing.cpu_count() * 2 + 1
worker_class = "gevent"
loglevel = "info"
timeout=100
graceful_timeout=100
keepalive=256
preload=True

References