benoitc / gunicorn

gunicorn 'Green Unicorn' is a WSGI HTTP Server for UNIX, fast clients and sleepy applications.
http://www.gunicorn.org
Other
9.86k stars 1.75k forks source link

API calls getting stuck indefinitely possibiliy due to error "gevent.exceptions.LoopExit" #3117

Closed vipinsankhhwar closed 5 months ago

vipinsankhhwar commented 11 months ago

Issue: We have recently converted one of our Flask+Gunicorn based service from Python2 to Python3(Py23 complaint code, minor syntactical changes). Post deployment, we are often observing API calls to this service getting stuck. This is not the issue of loaded service as MEM/CPU utilization remains low. Did some research on the error but not very helpful to resolve the issue. We do monkey patch at very beginning of service main file.(_from gevent import monkey; monkey.patchall())

Setup details

Few observations from the issue

  1. We could not reproduce this in QA environment, happens only in production.
  2. Occurrence is random. We observed this in least peak hours too.
  3. Memory grows over time, probably as a side effect to this issue.
  4. In python2 we never faced this issues.
  5. We do use ThreadPoolExecutor for some of our API implementation for calling downstream services parallely. with ThreadPoolExecutor(max_workers=8) as executor: list(executor.map(lambda args : call_apis(*args), [ for args in argList]))
  6. Most of the time issue happens, there is a Gevent traceback in logs. Have have not seen this traceback in python2.
  7. `{"log":"\tHub: <Hub '' at 0x7f5b90267680 epoll default pending=0 ref=1 fileno=6 resolver=<gevent.resolver.thread.Resolver at 0x7f5b9022b2e0 pool=<ThreadPool at 0x7f5b90264660 tasks=0 size=2 maxsize=10 hub=<Hub at 0x7f5b90267680 thread_ident=0x7f5b9311fb80>>> threadpool=<ThreadPool at 0x7f5b90264660 tasks=0 size=2 maxsize=10 hub=<Hub at 0x7f5b90267680 thread_ident=0x7f5b9311fb80>> thread_ident=0x7f5b9311fb80>"}

Traceback (most recent call last): File "\/usr\/local\/pyenv\/versions\/3.9.16\/lib\/python3.9\/site-packages\/gevent\/monkey.py", line 868, in _shutdown orig_shutdown() File "\/usr\/local\/pyenv\/versions\/3.9.16\/lib\/python3.9\/threading.py", line 1447, in _shutdown atexit_call() File "\/usr\/local\/pyenv\/versions\/3.9.16\/lib\/python3.9\/concurrent\/futures\/thread.py", line 31, in _python_exit t.join() File "\/usr\/local\/pyenv\/versions\/3.9.16\/lib\/python3.9\/threading.py", line 1060, in join self._wait_for_tstate_lock() File "\/usr\/local\/pyenv\/versions\/3.9.16\/lib\/python3.9\/threading.py", line 1080, in _wait_for_tstate_lock if lock.acquire(block, timeout): File "\/usr\/local\/pyenv\/versions\/3.9.16\/lib\/python3.9\/site-packages\/gevent\/thread.py", line 112, in acquire acquired = BoundedSemaphore.acquire(self, blocking, timeout) File "src\/gevent\/_semaphore.py", line 180, in gevent._gevent_c_semaphore.Semaphore.acquire File "src\/gevent\/_semaphore.py", line 259, in gevent._gevent_c_semaphore.Semaphore.acquire File "src\/gevent\/_semaphore.py", line 249, in gevent._gevent_c_semaphore.Semaphore.acquire File "src\/gevent\/_abstract_linkable.py", line 521, in gevent._gevent_c_abstract_linkable.AbstractLinkable._wait File "src\/gevent\/_abstract_linkable.py", line 487, in gevent._gevent_c_abstract_linkable.AbstractLinkable._wait_core File "src\/gevent\/_abstract_linkable.py", line 490, in gevent._gevent_c_abstract_linkable.AbstractLinkable._wait_core File "src\/gevent\/_abstract_linkable.py", line 442, in gevent._gevent_c_abstract_linkable.AbstractLinkable._AbstractLinkable__wait_to_be_notified File "src\/gevent\/_abstract_linkable.py", line 451, in gevent._gevent_c_abstract_linkable.AbstractLinkable._switch_to_hub File "src\/gevent\/_greenlet_primitives.py", line 61, in gevent._gevent_c_greenlet_primitives.SwitchOutGreenletWithLoop.switch File "src\/gevent\/_greenlet_primitives.py", line 65, in gevent._gevent_c_greenlet_primitives.SwitchOutGreenletWithLoop.switch File "src\/gevent\/_gevent_c_greenlet_primitives.pxd", line 35, in gevent._gevent_c_greenlet_primitives._greenlet_switch gevent.exceptions.LoopExit: This operation would block forever `

vipinsankhhwar commented 11 months ago

Can someone please take a look and help us with possible root causes? In the meantime, we have switched back the service to run in python2 where the issue is not visible. So it concludes that there are certain differences in gunicorn behavior in py3 causing the issue.

pajod commented 11 months ago

Even if you cannot provide a full reproducer, please do add more details: --print_config and the call / type of object that is used/shared by the sub-threads? If it is a locking issue, I would bet on sockets too, so what you are describing as "calling downstream services parallely" may very well be on point - sharing something you were meaning to instantiate separately.

(Also, are you aware that gevent.monkey also has is_module_patched() / is_object_patched()? That may help assert that your import order works as intended.)

vipinsankhhwar commented 11 months ago

Hi @pajod
Thanks for your quick response. Below is the requested config. Also sharing below a sample code on how we use multi threading in the APIs. There is no locking as such between the threads as each of them independently runs.

Also I am aware of is_module_patched(). However since we have done patch_all() at the beginning, have not cross checked/asserted it in the API implementations.

sample multi threaded APIs

from __future__ import absolute_import
from gevent import monkey;monkey.patch_all()
from flask import Flask, jsonify, request
from gunicorn.app.base import BaseApplication
from gunicorn import util
from six import iteritems
import threading, time, os
from concurrent.futures import ThreadPoolExecutor
import random, requests
import builtins
MAX_JOBTHREADS = 5
app = Flask(__name__)

def concurrently_execute_a_function(func, args_list, kwargs=None):
    num_of_workers = min(MAX_JOBTHREADS, len(args_list))
    with ThreadPoolExecutor(max_workers=num_of_workers) as executor:
        # result is list of responses we got from all the workers
        return list(executor.map(lambda args : func(*args, **(kwargs if kwargs else {})), args_list))

def _get_users(customerid, productid):
    apiserver_host = "https://xyz.com/%s/userconfig/v1/users"%productid
    r = requests.get(apiserver_host, params ={"customerid": customerid}).json()
    return r

@app.route('/users')
def get_users():
    customerid = request.args.get('customerid')
    arg_list = [(customerid, productid) for productid in [1,2]]
    result = concurrently_execute_a_function(_get_users, arg_list)
    return result

@app.route('/account')
def get_account_info():
    customerid = request.args.get('customerid')
    r = requests.get("https://xyz.com/plaform/licenseconfig", params ={"customerid": customerid}).json()

if __name__ == '__main__':
    class StandaloneApplication(BaseApplication):
        def __init__(self, app, options=None):
            self.options = options or {}
            self.application = app
            super(StandaloneApplication, self).__init__()

        def load_config(self):
            config = {key: value for key, value in iteritems(self.options) if key in self.cfg.settings and value is not None}
            for key, value in iteritems(config):
                self.cfg.set(key.lower(), value)

        def load(self):
            return self.application

    #https://docs.gunicorn.org/en/stable/settings.html
    options = {
        'bind': '0.0.0.0:10001',  # Adjust the binding address and port as needed
        'workers': 1,
        'worker_class': 'gevent',  # Use 'gevent' worker class for better concurrency,
        'timeout': 30,
        'graceful_timeout': 15,
        'max_requests': 2,
        'max_requests_jitter': 2,
        'keepalive': 70,
        'worker_connections': 10,
        'loglevel': "debug"

    }
    StandaloneApplication(app, options).run()

print_config result


accesslog                         = -
backlog                           = 2048
bind                              = ['0.0.0.0:3317']
ca_certs                          = None
capture_output                    = False
cert_reqs                         = 0
certfile                          = /druva/server.cert
chdir                             = /opt/Druva/One/DruvaOneServer/Deployment
check_config                      = False
child_exit                        = <ChildExit.child_exit()>
ciphers                           = ECDHE-RSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-SHA256:ECDHE-RSA-AES128-SHA:AES128-GCM-SHA256:ECDHE-RSA-AES256-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-SHA:HIGH:!RC4:!MD5:!aNULL:!3DES:!DES-CBC3-SHA:!EDH:!CAMELLIA:!ECDHE-RSA-AES256-SHA384:!ECDHE-RSA-AES256-GCM-SHA384
config                            = None
daemon                            = False
default_proc_name                 = gunicorn
disable_redirect_access_to_syslog = False
do_handshake_on_connect           = False
enable_stdio_inheritance          = False
errorlog                          = -
forwarded_allow_ips               = ['127.0.0.1']
graceful_timeout                  = 30
group                             = 0
initgroups                        = False
keepalive                         = 70
keyfile                           = /druva/server.pem
limit_request_field_size          = 8190
limit_request_fields              = 100
limit_request_line                = 4094
logconfig                         = None
logconfig_dict                    = {}
logger_class                      = gunicorn.glogging.Logger
loglevel                          = info
max_requests                      = 5000
max_requests_jitter               = 2500
nworkers_changed                  = <NumWorkersChanged.nworkers_changed()>
on_exit                           = <OnExit.on_exit()>
on_reload                         = <OnReload.on_reload()>
on_starting                       = <OnStarting.on_starting()>
paste                             = None
pidfile                           = None
post_fork                         = <post_fork()>
post_request                      = <PostRequest.post_request()>
post_worker_init                  = <post_worker_init()>
pre_exec                          = <PreExec.pre_exec()>
pre_fork                          = <Prefork.pre_fork()>
pre_request                       = <PreRequest.pre_request()>
preload_app                       = False
proc_name                         = None
proxy_allow_ips                   = ['127.0.0.1']
proxy_protocol                    = False
pythonpath                        = None
raw_env                           = []
raw_paste_global_conf             = []
reload                            = False
reload_engine                     = auto
reload_extra_files                = []
reuse_port                        = False
secure_scheme_headers             = {'X-FORWARDED-PROTOCOL': 'ssl', 'X-FORWARDED-PROTO': 'https', 'X-FORWARDED-SSL': 'on'}
sendfile                          = None
spew                              = False
ssl_version                       = 5
statsd_host                       = None
statsd_prefix                     =
suppress_ragged_eofs              = True
syslog                            = False
syslog_addr                       = udp://localhost:514
syslog_facility                   = user
syslog_prefix                     = None
threads                           = 1
timeout                           = 180
tmp_upload_dir                    = None
umask                             = 0
user                              = 0
when_ready                        = <when_ready()>
worker_abort                      = <WorkerAbort.worker_abort()>
worker_class                      = gevent
worker_connections                = 1000
worker_exit                       = <WorkerExit.worker_exit()>
worker_int                        = <WorkerInt.worker_int()>
worker_tmp_dir                    = None
workers                           = 4 <class 'gunicorn.config.Config'>```
pajod commented 10 months ago

Unable to trigger it like that on Python 3.11.2 / gevent 23.9.1 / greenlet 3.0.2 / flask 3.0.0 - it will always give me a nice traceback on unhandled exceptions and occasionally restart workers as instructed.

But since I did produce unrelated gevent AttributeError by overloading the server during the first worker restarts.. give that (high request counts at low max_requests) a shot and see if it helps towards reliable reproduction.

vipinsankhhwar commented 10 months ago

@pajod I tried with max_requests as low as 2 and high load(api requests) but I did not observe Attribute error. Nor did we observed the same in prod. I will try a bit more on this.

Also considering the fact that exact same application code works mostly fine in python2, most likely it should not be a locking issue otherwise py2 should also see the same behavior. Another change I noticed today is that in python3 we had to use "pymysql" driver instead of "mysqldb" for RDS connection. We did this as mysqldb cant be monkey patched in python3. There could be some more such implicit difference b/w py2 and 3 which we would look for.

In the meantime, can you help with below queries

  1. We have good logging around APIs. We log API at entry-point, success and failure. Often in stuck cases, we see logs for entry-point but no failure or completion logs. There are no logs in downstream services too against the stuck API request.
    • Theoretically, based on your expertise in the gunicorn framework, can you suggest what could cause this behavior?
    • When does loopExit error typically comes. Could this error be related to the issue? Also is this error conclusive for monkey patching issue?
    • Is there any enhanced logging that can be enabled to know where exactly is the gevent stuck?
vipinsankhhwar commented 6 months ago

An update: For us, issue looks to be related to Gevent incompatible urllib3 version. We are in process of testing this out with lower version of urllib3. Refer this.

pajod commented 6 months ago

Thanks for that context - though I am not convinced yet that the locking issue in the requests queue is directly related to the locking issue on thread join (btw, the latter is likely going to be deleted in 3.13 anyway; unfortunately testing against the beta fails for unrelated reasons).

vipinsankhhwar commented 5 months ago

We no more see this issue in production after downgrading urllib3 version. LoopExit errors were possibly a false alarm as we still see these errors in prod occasionally however server is responsive and requests are not getting stuck. Closing the ticket.