Closed davidski closed 3 years ago
Hi @davidski thanks for letting me know. Does this error happen with the DBI interface? Or is it just the dplyr interface?
I will have a little look at paws.database to see what the change could be and how to fix it :)
Thanks for the quick response! Yes, this problem occurs when making a DBI-style query as well. ☹️
@davidski how long did this query run before it timed out?
noctua
utilises the services of aws athena, s3 and glue. This corresponds to paws.analytics
and paws.storage
. To find out what is the most likely culprit of this error will break down the dbGetQuery call.
paws.analytics::athena -> start_query_execution
. This is to start an AWS Athena query.paws.analytics::athena -> get_query_execution
. Get memory usage from Athena.paws.analytics::athena -> get_query_execution
. Get Athena execution Statuspaws.analytics::athena -> get_query_results
. Get Athena column class so that it can be passed back to file parserspaws.storage::s3 -> get_object
. Get Athena resultpaws.analytics::athena -> get_query_execution
. Get Athena execution Statuspaws.storage::s3 -> delete_object
. Remove Athena S3 result file from S3. Note only called when cache equals 0As the statistics of the query has been returned (Info: (Data scanned: 14.09 MB)
) the best culprit would be dbFetch
. I believe it is paws.storage::s3 -> get_object
causing this issue. And a possible change to paws.common
. I will update issue https://github.com/paws-r/paws/issues/371 accordingly.
Thanks for diagnosing the bug, which we inappropriately introduced as a default timeout in the last release of paws.common
. Sorry about that. The latest version (0.3.8) with a fix (no timeout) is now on CRAN.
@davidkretch thanks again. I will close this ticket.
Issue Description
Under
noctua
1.10.0, going topaws.database
0.1.10 seems to cause curl timeouts when using the dplyr interface to queries.(Semi-)Reproducible Example
Generating a clean reprex is tricky, but I have a local query managed under
renv
that reliably replicates the problem. Here is a redacted query (hitting an Apache log store in parquet format) that demonstrates the problem:Error under paws.database 1.10.0
```r > con <- dbConnect(noctua::athena(), profile_name = "REDACTED", region = "us-east-2", s3_staging_dir = 's3://REDACTED', work_group = "REDACTED") > query <- str_glue(" SELECT date_parse(timestamp, '%d/%b/%Y:%H:%i:%s +0000') AS timestamp, verb, request, response, CAST(bytes as integer) AS bytes, referrer, agent FROM REDACTED") > dat <- tbl(con, sql(query)) %>% collect() Info: (Data scanned: 14.09 MB) Error in curl::curl_fetch_memory(url, handle = handle): Timeout was reached: [REDACTED.s3.us-east-2.amazonaws.com] Operation timed out after 10000 milliseconds with 119006796 out of 316782524 bytes received Request failed. Retrying in 0.7 seconds... Error in curl::curl_fetch_memory(url, handle = handle): Timeout was reached: [REDACTED.s3.us-east-2.amazonaws.com] Operation timed out after 10000 milliseconds with 112822860 out of 316782524 bytes received Request failed. Retrying in 2.2 seconds... ```If left to run, the query goes through exponential back-off and eventually fails.
Running the same query under
paws.database
0.1.9 works without issue.noctua
1.9.1 also hits this problem, so this seems to be something in the interface withpaws
(or maybe even a problem withpaws
itself).Really appreciate the package. If there's a better way to help debug this, please let me know!