Closed Justintime50 closed 2 years ago
I believe this is an issue with the nginx side potentially as I just sent through two deployments that worked correctly but the web UI didn't work during that process. I assume that the frontend can't serve requests properly but the backend deployment system seemed untouched.
I've added uwsgi checks/balances which should have Harvey cannibalize its workers before it eats up all the system memory - I did confirm that Harvey has a large memory leak and runs away with everything the OS has. Now workers will kill themselves every hour, when they take a gig of memory, or they serve a few hundred requests.
As for the root problem, I don't believe that uwsgi is the bottleneck as I turned on stats and found it was able to serve requests in 10ms or less every time. This is either due to nginx or how the system is setup to bridge back to the OS of Mac.
I have tried every configuration option under the sun and am at a complete loss. Tried adjusting timeouts, process/thread counts, enabling/disabling buffers and keepalives to no avail. Ran some benchmarks and found that nginx's upstream_response_time
was sitting at ~10 seconds and timing out every few requests.
I've cornered the problem to uWSGI. When Harvey locked up in prod, I restarted the Docker container that serves the nginx service and tried to re-deploy one of my sites. Harvey was still locked up which leads me to believe that because the uWSGI service wasn't restarted and the problem persisted that the issue lies there and not necessarily with the nginx service.
Troubleshooting I've tried:
nginx
problem alone (restarted nginx without restarting uwsgi and it's still locked up)nginx + uwsgi
problem for this one project but only a problem when in production - potentially SSL related since that appears to be one of the only changes between local and prod?It's 100% related to the following two lines of the nginx Docker config:
- "traefik.http.routers.harveyapi.tls=true"
- "traefik.http.routers.harveyapi.tls.certresolver=letsencrypt"
Once I commented out these lines, everything started working again; however, this then removes SSL support which goes against the whole purpose of reverse proxying via nginx so we can secure the API. Need to figure out what about TLS/SSL/Letsencrypt and Nginx/uwsgi is not playing nice.
Interesting find, potentially related: https://github.com/docker/for-win/issues/8861
The suggestions from this thread ^ do not help my case. Downgrading Docker to 4.5 does not fix this problem suggesting that Docker itself is not the issue either.
Found some new info:
jhammond@jhammond $ curl -k https://harveyapi.justinpaulhammond.com/health -vvv
* Trying xxx.xxx.xxx.xxx:443...
* Connected to harveyapi.justinpaulhammond.com (xxx.xxx.xxx.xxx) port 443 (#0)
* ALPN: offers h2
* ALPN: offers http/1.1
* (304) (OUT), TLS handshake, Client hello (1):
* (304) (IN), TLS handshake, Server hello (2):
* (304) (IN), TLS handshake, Unknown (8):
* (304) (IN), TLS handshake, Certificate (11):
* (304) (IN), TLS handshake, CERT verify (15):
* (304) (IN), TLS handshake, Finished (20):
* (304) (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / AEAD-AES128-GCM-SHA256
* ALPN: server accepted h2
* Server certificate:
* subject: CN=harveyapi.justinpaulhammond.com
* start date: Oct 16 02:42:13 2022 GMT
* expire date: Jan 14 02:42:12 2023 GMT
* issuer: C=US; O=Let's Encrypt; CN=R3
* SSL certificate verify ok.
* Using HTTP2, server supports multiplexing
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* h2h3 [:method: GET]
* h2h3 [:path: /health]
* h2h3 [:scheme: https]
* h2h3 [:authority: harveyapi.justinpaulhammond.com]
* h2h3 [user-agent: curl/7.84.0]
* h2h3 [accept: */*]
* Using Stream ID: 1 (easy handle 0x7fef9f00bc00)
> GET /health HTTP/2
> Host: harveyapi.justinpaulhammond.com
> user-agent: curl/7.84.0
> accept: */*
>
* Connection state changed (MAX_CONCURRENT_STREAMS == 250)!
< HTTP/2 504
< content-type: text/html
< date: Tue, 15 Nov 2022 05:58:35 GMT
< server: nginx/1.23.1
< content-length: 167
<
<html>
<head><title>504 Gateway Time-out</title></head>
<body>
<center><h1>504 Gateway Time-out</h1></center>
<hr><center>nginx/1.23.1</center>
</body>
</html>
* Connection #0 to host harveyapi.justinpaulhammond.com left intact
It appears that the TLS handshake which I initially thought was the bottleneck is not the problem as it completes and hangs after it. I'm getting the error Connection state changed (MAX_CONCURRENT_STREAMS == 250)!
with a 504 which leads me to believe we are maxing out the connections and uwsgi/nginx then fail to serve more.
It's interesting to note that Connection #0 to host harveyapi.justinpaulhammond.com left intact
is stated too. Maybe the connections aren't closing at all.
Found out that nginx is closing connections due to uwsgi not responding in time. I swear I've tried every permutation at this point and wonder if this isn't some shortcoming in how this runs on macOS at this point.
Ran the bare Flask server (instead of uwsgi) behind nginx and the problem persisted which should rule out uwsgi being the problem. I wonder if this is a shortcoming of the host.docker.internal
DNS record used to connect from Docker to the host machine (macOS in my case).
I even tried limiting the connections that nginx would allow at one time thinking it was too many connections open at once and the timeout issues still occur for some that weren't rate limited.
I also spun up two instances of nginx and load balanced them thinking that maybe nginx couldn't handle the load but that didn't do a thing.
All of this keeps pushing me further into thinking it's a shortcoming with how networking works from containers to the host on macOS.
EDIT:
HA! Finally thought to try hitting the Python service locally on the server it's running from so it doesn't need to run through the nginx container running on Docker and I can consistently spam 10,000 requests to the service (uwsgi and Flask) back-to-back and they all come back with 200 status codes where this would grind the prod service through nginx to an absolute halt. The problem lies with the Nginx config and/or the shortcomings of Docker on macOS.
For anyone following this story in the future, you won't believe what the fix was.
I updated Docker.
Mind you, I keep my software up to date. It took Docker ~6 months to patch this bug just in the last couple weeks for it to go away. Between the new virtualization framework on macOS and a bug that would prematurely close connections via Docker, this problem was finally fixed. I didn't need to actually change anything on my end.
Harvey is locking up randomly still (much less than it did in the past) but after about a week of running as a daemon, it'll lock up and stop returning responses quickly or at all. The switch to uwsgi certainly helped as did daemonizing it; however, those items alone weren't enough. I'm unsure if the problem is due to Harvey itself (uwsgi) or if it's the nginx reverse proxy sitting in front of it. Either way, the config for both reside in this project and needs to be fixed.