Currently, slow running OpenStack API Requests (either stuck connecting or still waiting for the actual response) from the periodic DataGatherer task will block the HTTPServer connections from being processed.
The reverse is also true, a stalled client of the HTTPServer (e.g. opening a telnet session and not sending a request) will also block both the DataGatherer task and processing of other HTTPServer connections.
Observed Symptoms
Slow or failed prometheus requests
Statistics not being updated as often as you would expect
HTTP 500 responses and BrokenPipeError tracebacks being logged due to later trying to respond to prometheus clients which timed out and disconnected the socket
Cause
This happens because in the current code, we are intending to use the eventlet library for asynchronous non-blocking I/O, but, we are not using it correctly.
All code within the main application and all imported dependencies must import the special eventlet "green" versions of many python libraries (e.g. socket, time, threading, SimpleHTTPServer, etc) which yield to other green threads when they would have blocked waiting for I/O or to sleep. Currently this is not always done, as a result we often block other tasks from running.
In the past we also tried to use a threaded/forked model and avoid eventlet, however the python cinderclient library imports the green eventlet.sleep (unknown to us, I believe this is a bug) and thus we would sometimes get the error "greenlet.error: cannot switch to a different thread".
Fix
Fix this by ensuring the entire application is correctly using eventlet and green patched functions by importing eventlet and using eventlet.patcher.monkey_patch() before importing any other modules. This will automatically intercept every other import and always load the green version of a library.
Testing
To test we now have a working solution, you can
Block access to the Nova API (causes connect to hang for 120 seconds) using this firewall command:
iptables -I OUTPUT -p tcp -m state --state NEW --dport 8774 -j DROP
Make many concurrent and repeated requests using siege:
while true; do siege http://172.16.0.30:9183/metrics -t 5s -c 5 -d 0.1; done
When testing with these changes, I never see us block a server or client connection and all requests take a few milliseconds at most, whether or not the client requests are slow or we open a connection to the server that doesn't send a request.
History Lesson
There have been multiple incorrect attempts to solve this and some related problems. To try and avoid any further such problems, I have comprehensively documented the historical issues and why those fixes have not worked below, both for my understanding and yours :)
eventlet implements asynchronous "non-blocking" socket I/O without any code changes to the application and without using real pthreads by using co-operative "green threads" from the greenlet library.
For this to work correctly, greenlet needs to replace many python standard libraries (e.g. socket, time, threading) with an alternative "green" implementation which intentionally yields execution to other green threads anytime it's expected to block such as when reading data from a file/socket or sleeping.
All code both within the application and all imported dependencies must import these special versions, any code that doesn't won't yield cooperatively and will block other green threads whenever such a blocking function is called.
Explicitly importing all relevant modules from eventlet.green (both in the application and all dependencies)
Automatically during a single import with eventlet.patcher.import_patched - this must be used for every import in the main application
Automatically for all future imports by calling eventlet.patcher.monkey_patch before any other imports. This is the most practical option, as the rest of the code in both the application and it's dependencies can remain unmodified and are magically intercepted to import the eventlet.green version
The original Issue #112 found that the process deadlocked with the following error: greenlet.error: cannot switch to a different thread
At the time, we used a native Python Thread for the DataGatherer class and separately used the ForkingHTTPServer to allow both functions to operate simultaneously with real threads/processes.
We did not intend to use eventlet/green threads at all, however, the python-cinderclient library incorrectly imports eventlet.sleep which results in sometimes using green threads accidentally, hence the error.
We attempted to fix that in #115 by importing the green version of threading.Thread explicitly. This avoided the "cannot switch to a different thread" issue by only using green threads and not mixing Python threads and green threads in the same process.
After merging #115 it was found that the HTTPServer loop never co-operatively yielded to the DataGatherer's thread and the stats were never updated.
To fix this, #116 imported the green version of socket, asyncore and time and also littered a few sleep(0) calls around to force co-operative yielding at various points.
This solution was not complete, because it only imported the green version of some libraries, in some call paths. Plus hacked in some extra yields here and there.
In #124 we switched from ForkingHTTPServer to the normal HTTPServer because sometimes it would fork too many servers and hit the process or system-wide process limit.
Though not noted elsewhere, when I reproduce this issue by connecting many clients using the tool siege to a server where I firewalled the nova API connections, I can see that all of those processes are defunct and not actually alive. This is most likely because the process is blocked and the calls to waitpid which would reap them never happen.
Since we are not using the eventlet version of http.server.HTTPServer, without the forked model, we now block anytime we are handling a server request.
Additionally, anytime the DataGatherer green thread calls out through the OpenStack API libraries, it uses non-patched versions of socket/requests/urllib3 and also blocks the HTTPServer which is now inside the same process.
Currently, slow running OpenStack API Requests (either stuck connecting or still waiting for the actual response) from the periodic DataGatherer task will block the HTTPServer connections from being processed.
The reverse is also true, a stalled client of the HTTPServer (e.g. opening a telnet session and not sending a request) will also block both the DataGatherer task and processing of other HTTPServer connections.
Observed Symptoms
Cause
This happens because in the current code, we are intending to use the eventlet library for asynchronous non-blocking I/O, but, we are not using it correctly.
All code within the main application and all imported dependencies must import the special eventlet "green" versions of many python libraries (e.g. socket, time, threading, SimpleHTTPServer, etc) which yield to other green threads when they would have blocked waiting for I/O or to sleep. Currently this is not always done, as a result we often block other tasks from running.
In the past we also tried to use a threaded/forked model and avoid eventlet, however the python cinderclient library imports the green eventlet.sleep (unknown to us, I believe this is a bug) and thus we would sometimes get the error "greenlet.error: cannot switch to a different thread".
Fix
Fix this by ensuring the entire application is correctly using eventlet and green patched functions by importing eventlet and using eventlet.patcher.monkey_patch() before importing any other modules. This will automatically intercept every other import and always load the green version of a library.
Testing
To test we now have a working solution, you can
Block access to the Nova API (causes connect to hang for 120 seconds) using this firewall command: iptables -I OUTPUT -p tcp -m state --state NEW --dport 8774 -j DROP
Make many concurrent and repeated requests using siege: while true; do siege http://172.16.0.30:9183/metrics -t 5s -c 5 -d 0.1; done
When testing with these changes, I never see us block a server or client connection and all requests take a few milliseconds at most, whether or not the client requests are slow or we open a connection to the server that doesn't send a request.
History Lesson
There have been multiple incorrect attempts to solve this and some related problems. To try and avoid any further such problems, I have comprehensively documented the historical issues and why those fixes have not worked below, both for my understanding and yours :)
eventlet implements asynchronous "non-blocking" socket I/O without any code changes to the application and without using real pthreads by using co-operative "green threads" from the greenlet library.
For this to work correctly, greenlet needs to replace many python standard libraries (e.g. socket, time, threading) with an alternative "green" implementation which intentionally yields execution to other green threads anytime it's expected to block such as when reading data from a file/socket or sleeping. All code both within the application and all imported dependencies must import these special versions, any code that doesn't won't yield cooperatively and will block other green threads whenever such a blocking function is called.
This does not happen automatically, you can find the full details at https://eventlet.readthedocs.io/en/latest/patching.html but as a brief summary this can be done with 3 different methods:
The original Issue #112 found that the process deadlocked with the following error: greenlet.error: cannot switch to a different thread
At the time, we used a native Python Thread for the DataGatherer class and separately used the ForkingHTTPServer to allow both functions to operate simultaneously with real threads/processes.
We did not intend to use eventlet/green threads at all, however, the python-cinderclient library incorrectly imports eventlet.sleep which results in sometimes using green threads accidentally, hence the error.
We attempted to fix that in #115 by importing the green version of threading.Thread explicitly. This avoided the "cannot switch to a different thread" issue by only using green threads and not mixing Python threads and green threads in the same process.
After merging #115 it was found that the HTTPServer loop never co-operatively yielded to the DataGatherer's thread and the stats were never updated.
To fix this, #116 imported the green version of socket, asyncore and time and also littered a few sleep(0) calls around to force co-operative yielding at various points.
This solution was not complete, because it only imported the green version of some libraries, in some call paths. Plus hacked in some extra yields here and there.
In #124 we switched from ForkingHTTPServer to the normal HTTPServer because sometimes it would fork too many servers and hit the process or system-wide process limit.
Though not noted elsewhere, when I reproduce this issue by connecting many clients using the tool
siege
to a server where I firewalled the nova API connections, I can see that all of those processes are defunct and not actually alive. This is most likely because the process is blocked and the calls to waitpid which would reap them never happen.Since we are not using the eventlet version of http.server.HTTPServer, without the forked model, we now block anytime we are handling a server request.
Additionally, anytime the DataGatherer green thread calls out through the OpenStack API libraries, it uses non-patched versions of socket/requests/urllib3 and also blocks the HTTPServer which is now inside the same process.