Closed dleehr closed 5 years ago
Message in d4s2-app container logs:
[2019-06-12 13:43:41 -0400] [16] [CRITICAL] WORKER TIMEOUT (pid:57)
[2019-06-12 17:43:41 +0000] [57] [INFO] Worker exiting (pid: 57)
[2019-06-12 13:43:41 -0400] [59] [INFO] Booting worker with pid: 59
ERROR:django.security.DisallowedHost:Invalid HTTP_HOST header: 'datadelivery-fe.genome.duke.edu'. You may need to add 'datadelivery-fe.genome.duke.edu' to ALLOWED_HOSTS.
Could be a result of the load balancer?
Message in nginx container logs:
10.12.55.230 - - [12/Jun/2019:17:48:17 +0000] "GET /download/dds-projects/a835181b-472a-45f9-b4bc-0ba7b097e6ae/prov_art_project%202017-04-27%2020:27.zip HTTP/1.1" 200 1246610 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1.1 Safari/605.1.15"
2019/06/12 17:48:17 [info] 5#5: *14360 client 10.12.55.230 closed keepalive connection
One important difference between dev and prod is that prod uses an external load balancer, with the IP mentioned above. I wonder if it's worth troubleshooting that connection, or if we should instead be thinking about how to move this application to kubernetes.
Created RT 6129, but after checking again to verify behavior, downloads are working quickly. Not sure what fixed it - maybe upgrading to python 3 #211?
Reopening due to failures on both dev and prod with nginx proxy. In both cases, attempting to download a 200GB dataset fails after about 1GB. During presentation, this took several minutes because the download ran at about 2-3MB/sec. Back at my desk where the download runs at 20-30MB/sec, the failure occurred after 37s. This suggests the issue is related to the size and not the timing.
From project presentation feedback:
Research into client closed keep alive connection and “an upstream response is buffered in a temporary file”. Perhaps nginx stream configuration. Look at exposing python zip server via docker port. Is there a keep alive setting in nginx?
Reproducing on July 2, 2019 from ethernet and VPN connection:
nginx logs At download start:
2019/07/02 13:31:02 [warn] 9#9: *358 an upstream response is buffered to a temporary file /var/cache/nginx/proxy_temp/9/01/0000000019 while reading upstream, client: 10.14.47.140, server: , request: "GET /download/dds-projects/cab58630-89d7-4e4e-81d1-6436766a63c5/Sharma-s3.zip HTTP/1.1", upstream: "http://172.18.0.5:8000/download/dds-projects/cab58630-89d7-4e4e-81d1-6436766a63c5/Sharma-s3.zip", host: "datadelivery-dev.gcb.duke.edu", referrer: "https://datadelivery-dev.gcb.duke.edu/duke-ds-projects/cab58630-89d7-4e4e-81d1-6436766a63c5"
37s later, download fails at 1.09GB and the following appears in the nginx logs:
2019/07/02 13:31:39 [info] 6#6: *360 client 10.14.47.140 closed keepalive connection
10.14.47.140 - - [02/Jul/2019:13:31:47 +0000] "GET /download/dds-projects/cab58630-89d7-4e4e-81d1-6436766a63c5/Sharma-s3.zip HTTP/1.1" 200 1085440615 "https://datadelivery-dev.gcb.duke.edu/duke-ds-projects/cab58630-89d7-4e4e-81d1-6436766a63c5" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1.1 Safari/605.1.15"
2019/07/02 13:31:48 [info] 9#9: *358 client 10.14.47.140 closed keepalive connection
The only log entries in the python container coincide with the failure time:
[2019-07-02 09:31:30 -0400] [16] [CRITICAL] WORKER TIMEOUT (pid:23)
[2019-07-02 13:31:30 +0000] [23] [INFO] Worker exiting (pid: 23)
[2019-07-02 09:31:30 -0400] [25] [INFO] Booting worker with pid: 25
Some research into the timeout points to the gunicorn timeout setting. gunicorn is the web server we're using inside the python container to serve the d4s2 django app:
https://stackoverflow.com/a/16127679/595085
By default, the gunicorn timeout is 30s. While that's not exactly the timing we're seeing, here's my theory: When the download starts, nginx fills up its buffer (~1GB) with upstream data from gunicorn as it serves it to the client. While it's sending buffered data, it waits to request more from upstream. I'm guessing that timing is over 30s and when it's ready to ask for more data, the upstream connection has timed out.
I'll test this theory by increasing the gunicorn timeout, then see if there are other configs/settings that fix it.
Another issue pointed me to turning off the nginx proxy caching: https://stackoverflow.com/a/9784409/595085
After increasing the timeout to 120s, the download failed after 2.63GB. Note that the timestamp between upstream response is buffered indicating the start of the download and the GET 200 is 119s
2019/07/02 14:11:07 [warn] 6#6: *1 an upstream response is buffered to a temporary file /var/cache/nginx/proxy_temp/1/00/0000000001 while reading upstream, client: 10.14.47.140, server: , request: "GET /download/dds-projects/cab58630-89d7-4e4e-81d1-6436766a63c5/Sharma-s3.zip HTTP/1.1", upstream: "http://172.18.0.5:8000/download/dds-projects/cab58630-89d7-4e4e-81d1-6436766a63c5/Sharma-s3.zip", host: "datadelivery-dev.gcb.duke.edu", referrer: "https://datadelivery-dev.gcb.duke.edu/duke-ds-projects/cab58630-89d7-4e4e-81d1-6436766a63c5"
2019/07/02 14:11:35 [info] 6#6: *5 client 10.14.47.140 closed keepalive connection
10.14.47.140 - - [02/Jul/2019:14:13:06 +0000] "GET /download/dds-projects/cab58630-89d7-4e4e-81d1-6436766a63c5/Sharma-s3.zip HTTP/1.1" 200 2633557984 "https://datadelivery-dev.gcb.duke.edu/duke-ds-projects/cab58630-89d7-4e4e-81d1-6436766a63c5" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1.1 Safari/605.1.15"
2019/07/02 14:13:06 [info] 6#6: *1 client 10.14.47.140 closed keepalive connection
The gunicorn docs provide the configuration requirements to enable streaming downloads through an nginx proxy
Initially I'll apply these configurations for all of D4S2, but we could rather easily limit it to just downloads by booting up a second container.
After implementing these changes, I was able to get 70GB into a 190GB download. It ran for nearly 2 hours before I canceled it to avoid filling up my local disk :)
Though both are running the same release, downloads are not usable on the production server. They are slow and time out early.
An example 500MB project (prov_art_project 2017-04-27) consists of 6 files and fails after a few hundred KB on prod. It doesn't consistently fail at the same point and the download speed is below 100KB/sec.
Relative path:
/download/dds-projects/a835181b-472a-45f9-b4bc-0ba7b097e6ae/prov_art_project%202017-04-27%2020:27.zip
From datadelivery-dev, it downloads about at about 20MB/sec and completes in under 30s.