Justintime50 / harvey

The lightweight Docker Compose deployment runner.
https://github.com/Justintime50/harvey-ui
MIT License
13 stars 3 forks source link

Harvey Locking Up After Running for Days #67

Closed Justintime50 closed 2 years ago

Justintime50 commented 2 years ago

Harvey is locking up randomly still (much less than it did in the past) but after about a week of running as a daemon, it'll lock up and stop returning responses quickly or at all. The switch to uwsgi certainly helped as did daemonizing it; however, those items alone weren't enough. I'm unsure if the problem is due to Harvey itself (uwsgi) or if it's the nginx reverse proxy sitting in front of it. Either way, the config for both reside in this project and needs to be fixed.

Justintime50 commented 2 years ago

I believe this is an issue with the nginx side potentially as I just sent through two deployments that worked correctly but the web UI didn't work during that process. I assume that the frontend can't serve requests properly but the backend deployment system seemed untouched.

Justintime50 commented 2 years ago

I've added uwsgi checks/balances which should have Harvey cannibalize its workers before it eats up all the system memory - I did confirm that Harvey has a large memory leak and runs away with everything the OS has. Now workers will kill themselves every hour, when they take a gig of memory, or they serve a few hundred requests.

As for the root problem, I don't believe that uwsgi is the bottleneck as I turned on stats and found it was able to serve requests in 10ms or less every time. This is either due to nginx or how the system is setup to bridge back to the OS of Mac.

Justintime50 commented 2 years ago

I have tried every configuration option under the sun and am at a complete loss. Tried adjusting timeouts, process/thread counts, enabling/disabling buffers and keepalives to no avail. Ran some benchmarks and found that nginx's upstream_response_time was sitting at ~10 seconds and timing out every few requests.

Justintime50 commented 2 years ago

I've cornered the problem to uWSGI. When Harvey locked up in prod, I restarted the Docker container that serves the nginx service and tried to re-deploy one of my sites. Harvey was still locked up which leads me to believe that because the uWSGI service wasn't restarted and the problem persisted that the issue lies there and not necessarily with the nginx service.

Justintime50 commented 2 years ago

Troubleshooting I've tried:

  1. Not a DNS issue
  2. Not a networking issue
  3. Not a nginx problem alone (restarted nginx without restarting uwsgi and it's still locked up)
  4. I threw 2000 concurrent requests at the app locally via nginx and uwsgi and it always serves them in less than a second, trying to throw 100 requests in prod locks it up
  5. I can throw 2000 requests at my personal website which is hosted on the same server and Docker instance in the same network and they all serve very quick without locking up which means this is an nginx + uwsgi problem for this one project but only a problem when in production - potentially SSL related since that appears to be one of the only changes between local and prod?
Justintime50 commented 2 years ago

It's 100% related to the following two lines of the nginx Docker config:

      - "traefik.http.routers.harveyapi.tls=true"
      - "traefik.http.routers.harveyapi.tls.certresolver=letsencrypt"

Once I commented out these lines, everything started working again; however, this then removes SSL support which goes against the whole purpose of reverse proxying via nginx so we can secure the API. Need to figure out what about TLS/SSL/Letsencrypt and Nginx/uwsgi is not playing nice.

Justintime50 commented 2 years ago

Interesting find, potentially related: https://github.com/docker/for-win/issues/8861

Justintime50 commented 2 years ago

The suggestions from this thread ^ do not help my case. Downgrading Docker to 4.5 does not fix this problem suggesting that Docker itself is not the issue either.

Justintime50 commented 2 years ago

Found some new info:

jhammond@jhammond $ curl -k https://harveyapi.justinpaulhammond.com/health -vvv
*   Trying xxx.xxx.xxx.xxx:443...
* Connected to harveyapi.justinpaulhammond.com (xxx.xxx.xxx.xxx) port 443 (#0)
* ALPN: offers h2
* ALPN: offers http/1.1
* (304) (OUT), TLS handshake, Client hello (1):
* (304) (IN), TLS handshake, Server hello (2):
* (304) (IN), TLS handshake, Unknown (8):
* (304) (IN), TLS handshake, Certificate (11):
* (304) (IN), TLS handshake, CERT verify (15):
* (304) (IN), TLS handshake, Finished (20):
* (304) (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / AEAD-AES128-GCM-SHA256
* ALPN: server accepted h2
* Server certificate:
*  subject: CN=harveyapi.justinpaulhammond.com
*  start date: Oct 16 02:42:13 2022 GMT
*  expire date: Jan 14 02:42:12 2023 GMT
*  issuer: C=US; O=Let's Encrypt; CN=R3
*  SSL certificate verify ok.
* Using HTTP2, server supports multiplexing
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* h2h3 [:method: GET]
* h2h3 [:path: /health]
* h2h3 [:scheme: https]
* h2h3 [:authority: harveyapi.justinpaulhammond.com]
* h2h3 [user-agent: curl/7.84.0]
* h2h3 [accept: */*]
* Using Stream ID: 1 (easy handle 0x7fef9f00bc00)
> GET /health HTTP/2
> Host: harveyapi.justinpaulhammond.com
> user-agent: curl/7.84.0
> accept: */*
>
* Connection state changed (MAX_CONCURRENT_STREAMS == 250)!
< HTTP/2 504
< content-type: text/html
< date: Tue, 15 Nov 2022 05:58:35 GMT
< server: nginx/1.23.1
< content-length: 167
<
<html>
<head><title>504 Gateway Time-out</title></head>
<body>
<center><h1>504 Gateway Time-out</h1></center>
<hr><center>nginx/1.23.1</center>
</body>
</html>
* Connection #0 to host harveyapi.justinpaulhammond.com left intact

It appears that the TLS handshake which I initially thought was the bottleneck is not the problem as it completes and hangs after it. I'm getting the error Connection state changed (MAX_CONCURRENT_STREAMS == 250)! with a 504 which leads me to believe we are maxing out the connections and uwsgi/nginx then fail to serve more.

It's interesting to note that Connection #0 to host harveyapi.justinpaulhammond.com left intact is stated too. Maybe the connections aren't closing at all.

Justintime50 commented 2 years ago

Found out that nginx is closing connections due to uwsgi not responding in time. I swear I've tried every permutation at this point and wonder if this isn't some shortcoming in how this runs on macOS at this point.

Justintime50 commented 2 years ago

Ran the bare Flask server (instead of uwsgi) behind nginx and the problem persisted which should rule out uwsgi being the problem. I wonder if this is a shortcoming of the host.docker.internal DNS record used to connect from Docker to the host machine (macOS in my case).

I even tried limiting the connections that nginx would allow at one time thinking it was too many connections open at once and the timeout issues still occur for some that weren't rate limited.

I also spun up two instances of nginx and load balanced them thinking that maybe nginx couldn't handle the load but that didn't do a thing.

All of this keeps pushing me further into thinking it's a shortcoming with how networking works from containers to the host on macOS.

EDIT:

HA! Finally thought to try hitting the Python service locally on the server it's running from so it doesn't need to run through the nginx container running on Docker and I can consistently spam 10,000 requests to the service (uwsgi and Flask) back-to-back and they all come back with 200 status codes where this would grind the prod service through nginx to an absolute halt. The problem lies with the Nginx config and/or the shortcomings of Docker on macOS.

Justintime50 commented 2 years ago

For anyone following this story in the future, you won't believe what the fix was.

I updated Docker.

Mind you, I keep my software up to date. It took Docker ~6 months to patch this bug just in the last couple weeks for it to go away. Between the new virtualization framework on macOS and a bug that would prematurely close connections via Docker, this problem was finally fixed. I didn't need to actually change anything on my end.