Closed JiriHorky closed 3 years ago
Hi.
We would really need this issue to be resolved. Attaching image with response times from out service with and without the fix in PR #474.
Purple line is average response with current version of Bravado, blue line is average response with applied #474 fix.
Fixed in 11.0.3
Hello,
since version v10.1.0, the bravado client suffers from 2-3x performance degradation in uWSGI's default preforking execution mode. The problem is rather technical and it took us a while to understand what's happening:
Starting from commit f1ea220730b295aa625683eb23fb081fe893562b, the initialization of the bravado client contains import of
twisted.web.client
module. This, under the hood on Linux, initializesEPollReactor
fromtwisted.internet.epollreactor
module. The problem is that the EpollReactor, in its constructor, creates anepoll
socket (https://github.com/racker/python-twisted-core/blob/master/twisted/internet/epollreactor.py#L213).Now, when using preforking uWSGI mode, uWSGI initializes the application in its master process and then it takes advantage of the Linux's copy-on-write semantic of
fork()
call and forks its children worker processes which is usually very cheap operation. This way, the application code needs to be initialized just once instead of UWSGI_PROCESSES times. In our case, the start of the application takes 30s, so doing it just once instead of 20 times makes significant difference. The children processes in Linux inherits parents resources, among others, open file descriptors.So in short, starting from v10.1.0, the children processes forked by uWSGI processes inherit the same epoll socket for asynchronous operations used by the fido http client that bravado uses. This, of course, creates conflicts, because different processes uses the same epoll control socket for adding its own file descriptors. It may very well happen that one uWSGI child instructs epoll control socket to start monitoring fd=15, while other uWSGI child instructs the epoll to monitor fd=14 (which are two very different connections in reality), that doesn't exist in the first child. So the
epoll_wait
function in the first child can return that it is possible to read from fd=14 which doesn't exist in the process!Now, to make things even "worse", instead of firing an exception, the
EPollReactor
class actually maintain its own map of file descriptors that it is interested in. If theepoll_wait
call returns a file descriptor that is not in the internal map, the code just silently ignores this (https://github.com/racker/python-twisted-core/blob/master/twisted/internet/epollreactor.py#L376). In this case, the function returns to themainLoop
function (https://github.com/racker/python-twisted-core/blob/master/twisted/internet/base.py#L1172) which modifies the timeout value and calls the epollreactor's doPoll function (https://github.com/racker/python-twisted-core/blob/master/twisted/internet/epollreactor.py#L350) again. Since there is, again, a file descriptor that can be read (the same one as before), but we are not interested in it, this tight loop continues again and again.This results in tens of thousands of
epoll_wait
andgettimeofday
(in order to recompute the timeout value) system calls a second up until some interested file descriptor is returned fromepoll_wait
function, or some other process modifies theepoll
control socket. These syscalls and tight loop makes bravado around 3x slower than previously.The workaround is to use
lazy-apps
option of uWSGI which makes uWSGI to fully initialize the application. But it also significantly slows down the start of the application.It would be great if the bravado could initialize itself without actually importing the
twisted.web.client
module in its constructors, so it is still possible to safely use it in the preforking uWSGI mode.I enclose the stack trace of our application that leads to the creation of the
epoll
socket:PS: This issue was quite hard to debug since the performance degradation happened in other than the main thread of the application. If requested, I can provide example profiling data of the affected threads confirming the bug.