Closed EricHorst closed 4 years ago
Testing locally with server and vegeta to generate load shows that throughput is not happening with gunicorn 19.9.0. Only 13.3% success
I realized that there is a sleep(10) before each call to alertapi which ties up a worker for 10 seconds plus whatever time alertapi takes to respond.
We're using eventlet which is a cooperative green thread system. In eventlet, all workers are co-running and must yield when blocked or other green threads don't make progress. It seems we're not yielding on the sleep, or maybe not at all.
=== python 3.6 gunicorn 19.9.0
[root@prospero tests]# echo "POST http://127.0.0.1:3080/" | ~erich/bin/vegeta attack -body=alertmanager-alert-ok.json -workers=5 -duration=60s -rate=1/s > results-p3.6-gu19.9.gob
Requests [total, rate, throughput] 60, 1.02, 0.09
Duration [total, attack, wait] 1m28.999820707s, 58.999427518s, 30.000393189s
Latencies [mean, 50, 95, 99, max] 26.933443191s, 30.000322387s, 30.001100842s, 30.001195633s, 30.001201497s
Bytes In [total, mean] 0, 0.00
Bytes Out [total, mean] 8472, 141.20
Success [ratio] 13.33%
Status Codes [code:count] 0:52 202:8
Error Set:
Post http://127.0.0.1:3080/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Post http://127.0.0.1:3080/: EOF
Post http://127.0.0.1:3080/: read tcp 127.0.0.1:56621->127.0.0.1:3080: read: connection reset by peer
Post http://127.0.0.1:3080/: read tcp 127.0.0.1:39516->127.0.0.1:3080: read: connection reset by peer
Post http://127.0.0.1:3080/: read tcp 127.0.0.1:41040->127.0.0.1:3080: read: connection reset by peer
Post http://127.0.0.1:3080/: read tcp 127.0.0.1:32898->127.0.0.1:3080: read: connection reset by peer
Post http://127.0.0.1:3080/: read tcp 127.0.0.1:48273->127.0.0.1:3080: read: connection reset by peer
Post http://127.0.0.1:3080/: read tcp 127.0.0.1:34784->127.0.0.1:3080: read: connection reset by peer
Eventlet "monkey patches" all the potentially blocking calls in python in order that they will yield to other green threads. (It is also possible to call eventlet.sleep(10) but shouldn't be necessary.)
In issue https://github.com/benoitc/gunicorn/issues/1847 "gunicorn with eventlet doesn't perform eventlet monkey patching since 19.8.0"
It is pointed out that gunicorn 19.8.x (released April 2018) and 19.9.x (released July 2018) broke monkey patching. This would mean that during our sleep(10) and also during any outgoing calls to AlertAPI which may be slow will not yield to other threads.
It was fixed in gunicorn release 20.0 (released Nov 23, 2019)
=== I upgraded gunicorn to 20.0.4 However this would start, but not run correctly on python 3.7, generating an exception when network communication outbound occurred.
In issue https://github.com/eventlet/eventlet/issues/526 "python 3.7 - wrap_socket() got an unexpected keyword argument '_context' " This issue raised Sept 2018 and still ongoing
There is lots of talk about how monkey patching is not working correctly on socketio. am2alertapi is using requests module for outbound talking to alertapi. And this is where we get the same error.
The issue continues for more than a year and suggested workarounds are:
Finally on about Dec 21, 2019 it is suggested: "Installing pyopenssl 19.1.0 fixed the issue for me."
I installed pyopenssl (was not installed prior) via requirements.txt and updated gunicorn to 20.0.4 thusly: gunicorn==20.0.4 pyopenssl==19.1.0
This seems to fix the issue.
Here are results running locally on python 3.6. Works great as it should since is not broken in 3.6
100% success
python 3.6 gunicorn 20.0.4
[root@prospero tests]# echo "POST http://127.0.0.1:3080/" | ~erich/bin/vegeta attack -body=alertmanager-alert-ok.json -duration=60s -rate=1/s > results-p3.6-e20.gob
[root@prospero tests]# ~erich/bin/vegeta report results-p3.6-e20.gob
Requests [total, rate, throughput] 60, 1.02, 0.93
Duration [total, attack, wait] 1m4.654354675s, 58.99999024s, 5.654364435s
Latencies [mean, 50, 95, 99, max] 5.241673923s, 5.249762151s, 9.38993741s, 9.97807986s, 9.989372664s
Bytes In [total, mean] 0, 0.00
Bytes Out [total, mean] 63540, 1059.00
Success [ratio] 100.00%
Status Codes [code:count] 202:60
Error Set:
Here's a load test performed in cluster: 100% success
-bash-4.2$ kubectl exec -it centos-debug -- /bin/bash
[root@centos-debug /]# echo "POST http://am2alertapi-test.mci-monitoring/" | ./vegeta attack -body=alertmanager-alert-ok.json -workers=5 -duration=60s -rate=1/s > results-p3.7-gu20.0.4.gob
[root@centos-debug /]# ./vegeta report results-p3.7-gu20.0.4.gob
Requests [total, rate, throughput] 60, 1.02, 0.93
Duration [total, attack, wait] 1m4.506399679s, 59.000050208s, 5.506349471s
Latencies [mean, 50, 95, 99, max] 4.81096556s, 4.554851923s, 9.563385573s, 9.894919926s, 9.905004801s
Bytes In [total, mean] 0, 0.00
Bytes Out [total, mean] 63600, 1060.00
Success [ratio] 100.00%
Status Codes [code:count] 202:60
Error Set:
Fully deployed and seems okay.
am2alertapi running in MCI starts and does health checks okay. However, when any amount of load comes, like more than three or four connections (alert submissions) then it stops responding to health check connections and kubernetes kills it. Rinse and repeat.