UWIT-UE / am2alertapi

Prometheus alertmanager to UW alertAPI
GNU General Public License v3.0
0 stars 1 forks source link

am2alertapi pod restarts under load #4

Closed EricHorst closed 4 years ago

EricHorst commented 4 years ago

am2alertapi running in MCI starts and does health checks okay. However, when any amount of load comes, like more than three or four connections (alert submissions) then it stops responding to health check connections and kubernetes kills it. Rinse and repeat.

EricHorst commented 4 years ago

Testing locally with server and vegeta to generate load shows that throughput is not happening with gunicorn 19.9.0. Only 13.3% success

I realized that there is a sleep(10) before each call to alertapi which ties up a worker for 10 seconds plus whatever time alertapi takes to respond.

We're using eventlet which is a cooperative green thread system. In eventlet, all workers are co-running and must yield when blocked or other green threads don't make progress. It seems we're not yielding on the sleep, or maybe not at all.

=== python 3.6 gunicorn 19.9.0

[root@prospero tests]# echo "POST http://127.0.0.1:3080/" | ~erich/bin/vegeta attack -body=alertmanager-alert-ok.json -workers=5 -duration=60s -rate=1/s > results-p3.6-gu19.9.gob Requests [total, rate, throughput] 60, 1.02, 0.09 Duration [total, attack, wait] 1m28.999820707s, 58.999427518s, 30.000393189s Latencies [mean, 50, 95, 99, max] 26.933443191s, 30.000322387s, 30.001100842s, 30.001195633s, 30.001201497s Bytes In [total, mean] 0, 0.00 Bytes Out [total, mean] 8472, 141.20 Success [ratio] 13.33% Status Codes [code:count] 0:52 202:8
Error Set: Post http://127.0.0.1:3080/: net/http: request canceled (Client.Timeout exceeded while awaiting headers) Post http://127.0.0.1:3080/: EOF Post http://127.0.0.1:3080/: read tcp 127.0.0.1:56621->127.0.0.1:3080: read: connection reset by peer Post http://127.0.0.1:3080/: read tcp 127.0.0.1:39516->127.0.0.1:3080: read: connection reset by peer Post http://127.0.0.1:3080/: read tcp 127.0.0.1:41040->127.0.0.1:3080: read: connection reset by peer Post http://127.0.0.1:3080/: read tcp 127.0.0.1:32898->127.0.0.1:3080: read: connection reset by peer Post http://127.0.0.1:3080/: read tcp 127.0.0.1:48273->127.0.0.1:3080: read: connection reset by peer Post http://127.0.0.1:3080/: read tcp 127.0.0.1:34784->127.0.0.1:3080: read: connection reset by peer

EricHorst commented 4 years ago

Eventlet "monkey patches" all the potentially blocking calls in python in order that they will yield to other green threads. (It is also possible to call eventlet.sleep(10) but shouldn't be necessary.)

In issue https://github.com/benoitc/gunicorn/issues/1847 "gunicorn with eventlet doesn't perform eventlet monkey patching since 19.8.0"

It is pointed out that gunicorn 19.8.x (released April 2018) and 19.9.x (released July 2018) broke monkey patching. This would mean that during our sleep(10) and also during any outgoing calls to AlertAPI which may be slow will not yield to other threads.

It was fixed in gunicorn release 20.0 (released Nov 23, 2019)

=== I upgraded gunicorn to 20.0.4 However this would start, but not run correctly on python 3.7, generating an exception when network communication outbound occurred.

In issue https://github.com/eventlet/eventlet/issues/526 "python 3.7 - wrap_socket() got an unexpected keyword argument '_context' " This issue raised Sept 2018 and still ongoing

There is lots of talk about how monkey patching is not working correctly on socketio. am2alertapi is using requests module for outbound talking to alertapi. And this is where we get the same error.

The issue continues for more than a year and suggested workarounds are:

Finally on about Dec 21, 2019 it is suggested: "Installing pyopenssl 19.1.0 fixed the issue for me."

I installed pyopenssl (was not installed prior) via requirements.txt and updated gunicorn to 20.0.4 thusly: gunicorn==20.0.4 pyopenssl==19.1.0

This seems to fix the issue.

EricHorst commented 4 years ago

Here are results running locally on python 3.6. Works great as it should since is not broken in 3.6

100% success

python 3.6 gunicorn 20.0.4 [root@prospero tests]# echo "POST http://127.0.0.1:3080/" | ~erich/bin/vegeta attack -body=alertmanager-alert-ok.json -duration=60s -rate=1/s > results-p3.6-e20.gob [root@prospero tests]# ~erich/bin/vegeta report results-p3.6-e20.gob Requests [total, rate, throughput] 60, 1.02, 0.93 Duration [total, attack, wait] 1m4.654354675s, 58.99999024s, 5.654364435s Latencies [mean, 50, 95, 99, max] 5.241673923s, 5.249762151s, 9.38993741s, 9.97807986s, 9.989372664s Bytes In [total, mean] 0, 0.00 Bytes Out [total, mean] 63540, 1059.00 Success [ratio] 100.00% Status Codes [code:count] 202:60
Error Set:

EricHorst commented 4 years ago

Here's a load test performed in cluster: 100% success

-bash-4.2$ kubectl exec -it centos-debug -- /bin/bash [root@centos-debug /]# echo "POST http://am2alertapi-test.mci-monitoring/" | ./vegeta attack -body=alertmanager-alert-ok.json -workers=5 -duration=60s -rate=1/s > results-p3.7-gu20.0.4.gob [root@centos-debug /]# ./vegeta report results-p3.7-gu20.0.4.gob Requests [total, rate, throughput] 60, 1.02, 0.93 Duration [total, attack, wait] 1m4.506399679s, 59.000050208s, 5.506349471s Latencies [mean, 50, 95, 99, max] 4.81096556s, 4.554851923s, 9.563385573s, 9.894919926s, 9.905004801s Bytes In [total, mean] 0, 0.00 Bytes Out [total, mean] 63600, 1060.00 Success [ratio] 100.00% Status Codes [code:count] 202:60
Error Set:

EricHorst commented 4 years ago

Fully deployed and seems okay.