Zip downloads failing on prod

dleehr commented 5 years ago

Though both are running the same release, downloads are not usable on the production server. They are slow and time out early.

An example 500MB project (prov_art_project 2017-04-27) consists of 6 files and fails after a few hundred KB on prod. It doesn't consistently fail at the same point and the download speed is below 100KB/sec.

Relative path: /download/dds-projects/a835181b-472a-45f9-b4bc-0ba7b097e6ae/prov_art_project%202017-04-27%2020:27.zip

From datadelivery-dev, it downloads about at about 20MB/sec and completes in under 30s.

dleehr commented 5 years ago

Message in d4s2-app container logs:

[2019-06-12 13:43:41 -0400] [16] [CRITICAL] WORKER TIMEOUT (pid:57)
[2019-06-12 17:43:41 +0000] [57] [INFO] Worker exiting (pid: 57)
[2019-06-12 13:43:41 -0400] [59] [INFO] Booting worker with pid: 59
ERROR:django.security.DisallowedHost:Invalid HTTP_HOST header: 'datadelivery-fe.genome.duke.edu'. You may need to add 'datadelivery-fe.genome.duke.edu' to ALLOWED_HOSTS.

Could be a result of the load balancer?

dleehr commented 5 years ago

Message in nginx container logs:

10.12.55.230 - - [12/Jun/2019:17:48:17 +0000] "GET /download/dds-projects/a835181b-472a-45f9-b4bc-0ba7b097e6ae/prov_art_project%202017-04-27%2020:27.zip HTTP/1.1" 200 1246610 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1.1 Safari/605.1.15"
2019/06/12 17:48:17 [info] 5#5: *14360 client 10.12.55.230 closed keepalive connection

One important difference between dev and prod is that prod uses an external load balancer, with the IP mentioned above. I wonder if it's worth troubleshooting that connection, or if we should instead be thinking about how to move this application to kubernetes.

dleehr commented 5 years ago

Created RT 6129, but after checking again to verify behavior, downloads are working quickly. Not sure what fixed it - maybe upgrading to python 3 #211?

dleehr commented 5 years ago

Reopening due to failures on both dev and prod with nginx proxy. In both cases, attempting to download a 200GB dataset fails after about 1GB. During presentation, this took several minutes because the download ran at about 2-3MB/sec. Back at my desk where the download runs at 20-30MB/sec, the failure occurred after 37s. This suggests the issue is related to the size and not the timing.

From project presentation feedback:

Research into client closed keep alive connection and “an upstream response is buffered in a temporary file”. Perhaps nginx stream configuration. Look at exposing python zip server via docker port. Is there a keep alive setting in nginx?

Reproducing on July 2, 2019 from ethernet and VPN connection:

nginx logs At download start:

2019/07/02 13:31:02 [warn] 9#9: *358 an upstream response is buffered to a temporary file /var/cache/nginx/proxy_temp/9/01/0000000019 while reading upstream, client: 10.14.47.140, server: , request: "GET /download/dds-projects/cab58630-89d7-4e4e-81d1-6436766a63c5/Sharma-s3.zip HTTP/1.1", upstream: "http://172.18.0.5:8000/download/dds-projects/cab58630-89d7-4e4e-81d1-6436766a63c5/Sharma-s3.zip", host: "datadelivery-dev.gcb.duke.edu", referrer: "https://datadelivery-dev.gcb.duke.edu/duke-ds-projects/cab58630-89d7-4e4e-81d1-6436766a63c5"

37s later, download fails at 1.09GB and the following appears in the nginx logs:

2019/07/02 13:31:39 [info] 6#6: *360 client 10.14.47.140 closed keepalive connection
10.14.47.140 - - [02/Jul/2019:13:31:47 +0000] "GET /download/dds-projects/cab58630-89d7-4e4e-81d1-6436766a63c5/Sharma-s3.zip HTTP/1.1" 200 1085440615 "https://datadelivery-dev.gcb.duke.edu/duke-ds-projects/cab58630-89d7-4e4e-81d1-6436766a63c5" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1.1 Safari/605.1.15"
2019/07/02 13:31:48 [info] 9#9: *358 client 10.14.47.140 closed keepalive connection

The only log entries in the python container coincide with the failure time:

[2019-07-02 09:31:30 -0400] [16] [CRITICAL] WORKER TIMEOUT (pid:23)
[2019-07-02 13:31:30 +0000] [23] [INFO] Worker exiting (pid: 23)
[2019-07-02 09:31:30 -0400] [25] [INFO] Booting worker with pid: 25

dleehr commented 5 years ago

Some research into the timeout points to the gunicorn timeout setting. gunicorn is the web server we're using inside the python container to serve the d4s2 django app:

https://stackoverflow.com/a/16127679/595085

By default, the gunicorn timeout is 30s. While that's not exactly the timing we're seeing, here's my theory: When the download starts, nginx fills up its buffer (~1GB) with upstream data from gunicorn as it serves it to the client. While it's sending buffered data, it waits to request more from upstream. I'm guessing that timing is over 30s and when it's ready to ask for more data, the upstream connection has timed out.

I'll test this theory by increasing the gunicorn timeout, then see if there are other configs/settings that fix it.

dleehr commented 5 years ago

Another issue pointed me to turning off the nginx proxy caching: https://stackoverflow.com/a/9784409/595085

dleehr commented 5 years ago

After increasing the timeout to 120s, the download failed after 2.63GB. Note that the timestamp between upstream response is buffered indicating the start of the download and the GET 200 is 119s

2019/07/02 14:11:07 [warn] 6#6: *1 an upstream response is buffered to a temporary file /var/cache/nginx/proxy_temp/1/00/0000000001 while reading upstream, client: 10.14.47.140, server: , request: "GET /download/dds-projects/cab58630-89d7-4e4e-81d1-6436766a63c5/Sharma-s3.zip HTTP/1.1", upstream: "http://172.18.0.5:8000/download/dds-projects/cab58630-89d7-4e4e-81d1-6436766a63c5/Sharma-s3.zip", host: "datadelivery-dev.gcb.duke.edu", referrer: "https://datadelivery-dev.gcb.duke.edu/duke-ds-projects/cab58630-89d7-4e4e-81d1-6436766a63c5"

2019/07/02 14:11:35 [info] 6#6: *5 client 10.14.47.140 closed keepalive connection
10.14.47.140 - - [02/Jul/2019:14:13:06 +0000] "GET /download/dds-projects/cab58630-89d7-4e4e-81d1-6436766a63c5/Sharma-s3.zip HTTP/1.1" 200 2633557984 "https://datadelivery-dev.gcb.duke.edu/duke-ds-projects/cab58630-89d7-4e4e-81d1-6436766a63c5" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1.1 Safari/605.1.15"
2019/07/02 14:13:06 [info] 6#6: *1 client 10.14.47.140 closed keepalive connection

dleehr commented 5 years ago

The gunicorn docs provide the configuration requirements to enable streaming downloads through an nginx proxy

Turn off proxy buffering for the streaming location: https://github.com/Duke-GCB/gcb-ansible-roles/commit/dcf9e215d9509543ab0d05cadb52b3b570949bd2
Use an async worker class for gunicorn (such as gevent). The default worker class is sync workers, which must complete within the timeout. Raising the timeout alone isn't enough to fix the issue.

Initially I'll apply these configurations for all of D4S2, but we could rather easily limit it to just downloads by booting up a second container.

dleehr commented 5 years ago

After implementing these changes, I was able to get 70GB into a 190GB download. It ran for nearly 2 hours before I canceled it to avoid filling up my local disk :)

Duke-GCB / D4S2

Zip downloads failing on prod #212