Yelp / bravado

Bravado is a python client library for Swagger 2.0 services
Other
605 stars 117 forks source link

Severe performance degradation in uWSGI preforking mode starting from v10.1.0 #473

Closed JiriHorky closed 3 years ago

JiriHorky commented 3 years ago

Hello,

since version v10.1.0, the bravado client suffers from 2-3x performance degradation in uWSGI's default preforking execution mode. The problem is rather technical and it took us a while to understand what's happening:

Starting from commit f1ea220730b295aa625683eb23fb081fe893562b, the initialization of the bravado client contains import of twisted.web.client module. This, under the hood on Linux, initializes EPollReactor from twisted.internet.epollreactor module. The problem is that the EpollReactor, in its constructor, creates an epoll socket (https://github.com/racker/python-twisted-core/blob/master/twisted/internet/epollreactor.py#L213).

Now, when using preforking uWSGI mode, uWSGI initializes the application in its master process and then it takes advantage of the Linux's copy-on-write semantic of fork() call and forks its children worker processes which is usually very cheap operation. This way, the application code needs to be initialized just once instead of UWSGI_PROCESSES times. In our case, the start of the application takes 30s, so doing it just once instead of 20 times makes significant difference. The children processes in Linux inherits parents resources, among others, open file descriptors.

So in short, starting from v10.1.0, the children processes forked by uWSGI processes inherit the same epoll socket for asynchronous operations used by the fido http client that bravado uses. This, of course, creates conflicts, because different processes uses the same epoll control socket for adding its own file descriptors. It may very well happen that one uWSGI child instructs epoll control socket to start monitoring fd=15, while other uWSGI child instructs the epoll to monitor fd=14 (which are two very different connections in reality), that doesn't exist in the first child. So the epoll_wait function in the first child can return that it is possible to read from fd=14 which doesn't exist in the process!

Now, to make things even "worse", instead of firing an exception, the EPollReactor class actually maintain its own map of file descriptors that it is interested in. If the epoll_wait call returns a file descriptor that is not in the internal map, the code just silently ignores this (https://github.com/racker/python-twisted-core/blob/master/twisted/internet/epollreactor.py#L376). In this case, the function returns to the mainLoop function (https://github.com/racker/python-twisted-core/blob/master/twisted/internet/base.py#L1172) which modifies the timeout value and calls the epollreactor's doPoll function (https://github.com/racker/python-twisted-core/blob/master/twisted/internet/epollreactor.py#L350) again. Since there is, again, a file descriptor that can be read (the same one as before), but we are not interested in it, this tight loop continues again and again.

This results in tens of thousands of epoll_wait and gettimeofday (in order to recompute the timeout value) system calls a second up until some interested file descriptor is returned from epoll_wait function, or some other process modifies the epoll control socket. These syscalls and tight loop makes bravado around 3x slower than previously.

The workaround is to use lazy-apps option of uWSGI which makes uWSGI to fully initialize the application. But it also significantly slows down the start of the application.

It would be great if the bravado could initialize itself without actually importing the twisted.web.client module in its constructors, so it is still possible to safely use it in the preforking uWSGI mode.

I enclose the stack trace of our application that leads to the creation of the epoll socket:

  File "/usr/lib/python3.8/site-packages/gateway/production.py", line 9, in <module>
    from .app import create_app
  File "/usr/lib/python3.8/site-packages/gateway/app.py", line 27, in <module>
    from .clients import Clients
  File "/usr/lib/python3.8/site-packages/gateway/clients/__init__.py", line 8, in <module>
    from bravado.fido_client import FidoClient
  File "/usr/lib/python3.8/site-packages/bravado/fido_client.py", line 10, in <module>
    import twisted.web.client
  File "/usr/lib/python3.8/site-packages/twisted/web/client.py", line 39, in <module>
    from twisted.internet import defer, protocol, task, reactor
  File "/usr/lib/python3.8/site-packages/twisted/internet/reactor.py", line 39, in <module>
    default.install()
  File "/usr/lib/python3.8/site-packages/twisted/internet/epollreactor.py", line 265, in install
    p = EPollReactor()
  File "/usr/lib/python3.8/site-packages/twisted/internet/epollreactor.py", line 75, in __init__

PS: This issue was quite hard to debug since the performance degradation happened in other than the main thread of the application. If requested, I can provide example profiling data of the affected threads confirming the bug.

krtek commented 3 years ago

Hi.

We would really need this issue to be resolved. Attaching image with response times from out service with and without the fix in PR #474.

Purple line is average response with current version of Bravado, blue line is average response with applied #474 fix.

Screenshot 2021-03-02 at 11 21 42

analogue commented 3 years ago

Fixed in 11.0.3