cmu-delphi / delphi-epidata

An open API for epidemiological data.
https://cmu-delphi.github.io/delphi-epidata/
MIT License
100 stars 62 forks source link

API queries timing out at 1 min in the proxy #1394

Closed dshemetov closed 2 months ago

dshemetov commented 4 months ago

This query from @dsweber2 produces a 504 gateway timeout error after 30 60 seconds (even though timeout_seconds = 1600):

training_time <- epidatr::epirange(from = "2020-08-01", to = "2023-06-01")
fetch_args <- epidatr::fetch_args_list(return_empty = TRUE, timeout_seconds = 1600)
epidatr::pub_covidcast(
  source = "chng",
  signals = "smoothed_adj_outpatient_covid",
  geo_type = "state",
  time_type = "day",
  geo_values = "*",
  time_values = training_time,
  issues = "*",
  fetch_args = fetch_args
)

My guess is it's because the query hits the API row limit, mostly because splitting the query into two calls succeeds, however the API should return a result: 2 at least in that case, not a 504 error, so I'm definitely missing some details here.

melange396 commented 4 months ago

what is the full URI that it actually requests?

dshemetov commented 4 months ago

It is

https://api.delphi.cmu.edu/epidata/covidcast/?data_source=chng&signals=smoothed_adj_outpatient_covid&geo_type=state&time_type=day&geo_values=%2A&time_values=20200801-20230601&issues=%2A
melange396 commented 4 months ago

Are you sure this is happening at 30 seconds and not 60? I get a "504 Gateway Timeout" after [almost exactly but just slightly more than] a minute, and it comes from one of our proxies:

<html>
<head><title>504 Gateway Time-out</title></head>
<body>
<center><h1>504 Gateway Time-out</h1></center>
<hr><center>openresty/1.21.4.1</center>
</body>
</html>

This is not a row limit problem -- when i tested it, the application server was still happily waiting for results from MySQL when the proxy ended its session with the requestor.

In the past, we made sure all of the pieces in the request->response chain (AWS LB, HAProxy, web server, application, MySQL) were set to a timeout of (at least?) 15 minutes (900 seconds) to accommodate long queries... It appears that has changed somewhere at some point! I did a pretty exhaustive search through our configs for a timeout set to 30 or 60 seconds and didnt come up with anything, so i think this is inheriting a default value.

@korlaxxalrok any ideas where this might be? i thought MAYBE we needed to duplicate those timeout config settings to here...?

@dshemetov FWIW, the timeout_seconds you specify has no effect on the server; is a client-side thing that generates an event if it is still waiting on the server when that time limit is hit, and it shouldn't produce an HTTP response code.

dshemetov commented 4 months ago

Ah yea, you're right, David originally reported ~1 minute timeout. And I figured that timeout_seconds was client-side. It's just surprising because I'm pretty sure I had example queries where changing timeout_seconds=120 to timeout_seconds=150 made a difference, but I don't understand why it doesn't wait for more than a minute here. I can dig those up if those would help.

Also curious how these gunicorn timeout settings fit in the chain. Is it just a dev container thing?

melange396 commented 4 months ago

Maybe those example queries were from a time when this 60 second limit didnt exist?

The gunicorn_conf.py setting definitely affects the dev environment, but im not 100% sure that it affects production... Either way, at 5 minutes, its not a limit that is being reached in prod now, and we can follow up with that later when the proxy issue is resolved.

korlaxxalrok commented 4 months ago

I think we might be missing some proxy timeout settings in the Nginx/OpenResty proxies in AWS. I'll take a look.

We did try to set a minimum of 15 minutes across the board, but I think I missed this config when I set them up.

dshemetov commented 4 months ago

Updates from meeting today:

melange396 commented 4 months ago

Thank you to @korlaxxalrok for quickly fixing this by hand, and then enshrining it in https://github.com/cmu-delphi/delphi-admin/pull/192 !

@dshemetov are you certain that you saw >60s timeouts recently? The time limit in the proxy when this bug was filed should have been (as far as Brian and i can tell so far) the same as it had been for ~months. It should be noted that the 1 minute timeout only applies to "dead air" or a lack of network activity between the client and the webserver... For instance, if an SQL query took 55 seconds to execute and return to the webserver, but then took 125 seconds to transmit the gigabytes of results to the client before the socket was closed, it might look like a 3 minute transaction in total BUT it wouldn't have triggered a 60 second inactivity timeout.

dshemetov commented 4 months ago

Hm, I'm pretty sure I did. The Parquet size analysis I did was 28 days ago and I used this query with a 5*60 second timeout there and the query times at 2 minutes. It doubt it falls into the situation you describe above - the uncompressed CSV of the query data is only ~44MB.

https://api.delphi.cmu.edu/epidata/covid_hosp_state_timeseries/?states=ca%2Cfl&dates=20210101-20220101&issues=20210101-20220101
# same query as above
r$> dates <- epirange("2021-01-01", "2022-01-01")
r$> out <- system.time({
      df <- pub_covid_hosp_state_timeseries(
        states = "ca,fl",
        dates = dates,
        issues = dates,
        fetch_args = fetch_args_list(timeout_seconds = 5 * 60)
      )
    })

r$> out
   user  system elapsed 
 33.332   0.776 114.457 
melange396 commented 2 months ago

[ NOTE: I just found some notes that i wrote up back when this issue was newer, though i've made some edits and added a bit... Apologies for the delay in replying and probable loss of context. ]

When i did some requests for the same covid_hosp_state_timeseries url above using curl on a command line w/ output to the terminal, [ run times as of 2024-04-29 ] initially it took ~50s before it started returning data and 1m57s from start to finish, while a subsequent run of the same command took ~10s before starting to return data took and 1m14s total. The second run is likely benefitting from disk caching on the backing db instance, as a third run had very similar results to the second. I also ran it yet again with curl redirected to /dev/null and then with curl outputting to a file and still saw ~1.25minute total times. They also showed that the payload was 566MB, which is non-trivial and explains long-ish transmission times.

Importantly, even though all cases took longer than 60s in total, it was only between 10 and 50 seconds(ish) of waiting until the data started streaming -- meaning a significant portion of time was spent in actually transferring data. However, since the connection was only "idle" for between 10s-50s, it would not have triggered a 60 second server-side timeout while still allowing for a total time of longer than one minute.

Your timeout_seconds = 5 * 60 above presumably did not have any effect on the client or server, considering the elapsed time of ~114s is less than 5*60==300 seconds. Of note, it looks like in epidatr, timeout_seconds gets used with httr::timeout in do_request, and the docs for that say "Set maximum request time" and "number of seconds to wait for a response until giving up". I would guess that number corresponds to the time of the entire operation, since it is built using curl which uses the total time in its timeout option. I believe nearly all of the other timeouts in play (web app, web server, various proxies, and potentially even including the NAT timer on your local router/modem) are based on inactivity. (On the other hand, i also believe our db/mysql time limit settings include both inactivity timeouts and a query length time limit).

So... A 60 second timeout applied at a number of non-client-side places can cause an "idle connection" to be closed after 60 seconds, BUT can still result in a successful response that takes significantly longer than 60 seconds. Reproducing this is a little trickier now since we removed that 60s timeout and are now using the 15min timeout everywhere.

dshemetov commented 2 months ago

Thanks for digging into that. Good to know that curl timeout is based on total operation, while the other timeouts are based on inactivity. Your explanation of what happened in my case sounds plausible. Hard to pursue further without being able to reproduce.

Since we fixed the main issue, I'll close this one and open another if we run into more timeout problems.