DyfanJones / noctua

Connect R to Athena using paws SDK (DBI Interface)
https://dyfanjones.github.io/noctua/
Other
45 stars 5 forks source link

Weird query behavior: Successful query with HTTP 403 AccessDenied error #96

Closed mraess closed 4 years ago

mraess commented 4 years ago

Issue Description

I don't know if you remember, but not to long ago we discovered an issue in paws.common together and in an issue that had to do with pulling in profiles into the profile parameter into dbConnect(). Everything works fine even with AWS_PROFILE as env variables. I've been working with the updated packages and a new error prevents me from running a script on the server.

When I run a query, the data is pulled and written into an r object (see test_results below). However, I also get this error message:

Info: (Data scanned: 625.16 GB)
additional arguments ignored in warning()
Warning: AccessDenied (HTTP 403). Access Denied

Which is weird since the data are actually pulled and the connection to the database is successfully established. Any idea why this message is popping up?

As always, any insights would be great!

Reproducible Example

Example:

library(RJDBC)
library(glue)
library(tidyverse)
library(magrittr)

library(rlang)
library(noctua)
library(DBI)
library(paws.common)
library(paws)

con_3 <- dbConnect(noctua::athena(),
                   profile_name = "profile",
                   s3_staging_dir = "s3://aws-athena-query-results-XXXXXXXXXX-us-east-1/", 
                   region = "us-east-1")

test_query <- glue_sql("select event_date, count(x) as x_count from logs.x group by 1 limit 10;", .con = con_3)

# make sure all results are returned from athena

setMethod("dbGetQuery", signature(conn="JDBCConnection", statement="character"),  def=function(conn, statement, ...) {
  r <- dbSendQuery(con, test_query)
  on.exit(.jcall(r@stat, "V", "close"))
  if (conn@jc %instanceof% "com.amazonaws.athena.jdbc.AthenaConnection") fetch(r, -1, 999) # Athena can only pull 999 rows at a time
  else fetch(r, -1)
})

test_results <- dbGetQuery(conn = con_3, statement = test_query)
Session Info ```r > devtools::session_info() ─ Session info ─────────────────────────────────────────────────────────────────────────────────────────────────────── setting value version R version 3.5.3 (2019-03-11) os macOS 10.15.3 system x86_64, darwin15.6.0 ui RStudio language (EN) collate en_US.UTF-8 ctype en_US.UTF-8 tz America/New_York date 2020-06-12 ─ Packages ─────────────────────────────────────────────────────────────────────────────────────────────────────────── package * version date lib source assertthat 0.2.1 2019-03-21 [2] CRAN (R 3.5.2) backports 1.1.7 2020-05-13 [1] CRAN (R 3.5.3) bit 1.1-15.1 2020-01-14 [2] CRAN (R 3.5.2) bit64 0.9-7 2017-05-08 [2] CRAN (R 3.5.0) blob 1.2.1 2020-01-20 [2] CRAN (R 3.5.2) broom 0.5.6 2020-04-20 [1] CRAN (R 3.5.3) callr 3.4.3 2020-03-28 [1] CRAN (R 3.5.3) cellranger 1.1.0 2016-07-27 [2] CRAN (R 3.5.0) cli 2.0.2 2020-02-28 [1] CRAN (R 3.5.2) colorspace 1.4-1 2019-03-18 [1] CRAN (R 3.5.2) crayon 1.3.4 2017-09-16 [2] CRAN (R 3.5.0) curl 4.3 2019-12-02 [1] CRAN (R 3.5.2) data.table 1.12.8 2019-12-09 [1] CRAN (R 3.5.2) DBI * 1.1.0 2019-12-15 [2] CRAN (R 3.5.2) dbplyr 1.4.4 2020-05-27 [1] CRAN (R 3.5.3) desc 1.2.0 2018-05-01 [2] CRAN (R 3.5.0) devtools 2.3.0 2020-04-10 [1] CRAN (R 3.5.3) digest 0.6.25 2020-02-23 [1] CRAN (R 3.5.2) dplyr * 0.8.5 2020-03-07 [1] CRAN (R 3.5.2) ellipsis 0.3.1 2020-05-15 [1] CRAN (R 3.5.3) fansi 0.4.1 2020-01-08 [2] CRAN (R 3.5.2) forcats * 0.5.0 2020-03-01 [1] CRAN (R 3.5.2) fs 1.4.1 2020-04-04 [1] CRAN (R 3.5.3) generics 0.0.2 2018-11-29 [1] CRAN (R 3.5.0) ggplot2 * 3.3.0 2020-03-05 [1] CRAN (R 3.5.2) glue * 1.4.1 2020-05-13 [1] CRAN (R 3.5.3) gt * 0.2.1 2020-05-23 [1] CRAN (R 3.5.3) gtable 0.3.0 2019-03-25 [2] CRAN (R 3.5.3) haven 2.3.0 2020-05-24 [1] CRAN (R 3.5.3) hms 0.5.3 2020-01-08 [1] CRAN (R 3.5.2) httr 1.4.1 2019-08-05 [1] CRAN (R 3.5.2) ini 0.3.1 2018-05-20 [2] CRAN (R 3.5.0) jsonlite 1.6.1 2020-02-02 [1] CRAN (R 3.5.2) knitr 1.28 2020-02-06 [1] CRAN (R 3.5.2) lattice 0.20-38 2018-11-04 [2] CRAN (R 3.5.3) lifecycle 0.2.0 2020-03-06 [1] CRAN (R 3.5.2) lubridate 1.7.8 2020-04-06 [1] CRAN (R 3.5.3) magrittr * 1.5 2014-11-22 [2] CRAN (R 3.5.0) memoise 1.1.0 2017-04-21 [2] CRAN (R 3.5.0) modelr 0.1.8 2020-05-19 [1] CRAN (R 3.5.3) munsell 0.5.0 2018-06-12 [2] CRAN (R 3.5.0) nlme 3.1-143 2019-12-10 [2] CRAN (R 3.5.2) noctua * 1.7.0 2020-05-14 [1] CRAN (R 3.5.3) packrat 0.5.0 2018-11-14 [2] CRAN (R 3.5.0) paws * 0.1.8 2020-04-28 [1] CRAN (R 3.5.3) paws.analytics 0.1.8 2020-04-26 [1] CRAN (R 3.5.3) paws.common * 0.3.1 2020-04-26 [1] CRAN (R 3.5.3) paws.storage 0.1.8 2020-04-27 [1] CRAN (R 3.5.3) pillar 1.4.3 2019-12-20 [1] CRAN (R 3.5.2) pkgbuild 1.0.6 2019-10-09 [2] CRAN (R 3.5.2) pkgconfig 2.0.3 2019-09-22 [2] CRAN (R 3.5.2) pkgload 1.0.2 2018-10-29 [2] CRAN (R 3.5.0) prettyunits 1.1.1 2020-01-24 [2] CRAN (R 3.5.2) processx 3.4.2 2020-02-09 [1] CRAN (R 3.5.2) ps 1.3.2 2020-02-13 [1] CRAN (R 3.5.2) purrr * 0.3.4 2020-04-17 [1] CRAN (R 3.5.3) R6 2.4.1 2019-11-12 [1] CRAN (R 3.5.2) Rcpp 1.0.4.6 2020-04-09 [1] CRAN (R 3.5.3) readr * 1.3.1 2018-12-21 [1] CRAN (R 3.5.0) readxl 1.3.1 2019-03-13 [1] CRAN (R 3.5.2) remotes 2.1.1 2020-02-15 [1] CRAN (R 3.5.2) reprex 0.3.0 2019-05-16 [2] CRAN (R 3.5.2) rJava * 0.9-11 2019-03-29 [2] CRAN (R 3.5.2) RJDBC * 0.2-7.1 2018-04-16 [2] CRAN (R 3.5.0) rlang * 0.4.6 2020-05-02 [1] CRAN (R 3.5.3) RPostgreSQL 0.6-2 2017-06-24 [2] CRAN (R 3.5.0) rprojroot 1.3-2 2018-01-03 [2] CRAN (R 3.5.0) rstudioapi 0.11 2020-02-07 [1] CRAN (R 3.5.2) rvest 0.3.5 2019-11-08 [1] CRAN (R 3.5.2) scales 1.1.0 2019-11-18 [2] CRAN (R 3.5.2) sessioninfo 1.1.1 2018-11-05 [2] CRAN (R 3.5.0) stringi 1.4.6 2020-02-17 [1] CRAN (R 3.5.2) stringr * 1.4.0 2019-02-10 [1] CRAN (R 3.5.2) testthat 2.3.2 2020-03-02 [1] CRAN (R 3.5.2) tibble * 3.0.1 2020-04-20 [1] CRAN (R 3.5.3) tidyr * 1.0.2 2020-01-24 [1] CRAN (R 3.5.2) tidyselect 1.0.0 2020-01-27 [2] CRAN (R 3.5.3) tidyverse * 1.3.0 2019-11-21 [1] CRAN (R 3.5.2) usethis 1.6.1 2020-04-29 [2] CRAN (R 3.5.3) vctrs 0.3.0 2020-05-11 [1] CRAN (R 3.5.3) withr 2.2.0 2020-04-20 [1] CRAN (R 3.5.3) xfun 0.12 2020-01-13 [1] CRAN (R 3.5.2) xml2 1.3.2 2020-04-23 [1] CRAN (R 3.5.3) ```
DyfanJones commented 4 years ago

Hi @mraess yes I know what is happening :) It is nothing to be worries about.

noctua tries to keep the s3_staging_dir "tidy". The way it does this, is it tries to delete the AWS Athena Query output. However if the user doesn't have permission to do this, it returns the error as a warning so that it doesn't prevent the user using AWS Athena.

If you don't like the warning message all the time. Try query caching :D

This won't "tidy" the s3_staging_dir. Plus you have the extra benefit of repeat querying, which is quicker and cheaper as it doesn't need to call AWS Athena again for the same query :D https://dyfanjones.github.io/noctua/articles/aws_athena_query_caching.html

library(DBI)
library(noctua)

con = dbConnect(athena())

noctua_options(cache_size = 10)

dbGetQuery(con, "select * from iris")

Sorry for any confusion it has caused. Side note noctua now supports returning query in chunks (just released I need to update the documentations :S )

library(DBI)
library(noctua)

con = dbConnect(athena())

noctua_options(cache_size = 10)

res <- dbExecuteQuery(con, "select * from iris")

athena_iris <- dbFetch(res, 3000)

Just a quick heads up, the chunk method isn't the fastest if you want to get the entire data frame. I recommend the standard dbGetQuery as that will pull the data straight from AWS S3 :D

DyfanJones commented 4 years ago

Please let me know if this resolves you issue.

P.s. thanks of such detailed ticket :D it was a great help understanding what was going on :D

DyfanJones commented 4 years ago

Just thinking out loud, this behaviour should really be documented

mraess commented 4 years ago

Hey @DyfanJones , Thanks for the swift response as always! I'll definitely give this a shot today to see if that resolves the issue and then I'll follow up with you.

mraess commented 4 years ago

@DyfanJones that did the trick! It's working now locally and on the server. We can close this out. Thanks for putting it into the documentation!

DyfanJones commented 4 years ago

@mraess That is perfect. I will leave this open for the time being so I can get some time to add the documentation

DyfanJones commented 4 years ago

closing ticket as PR #97 has been merged

ramnathv commented 2 years ago

Thanks for this package @DyfanJones. I ran into this error and turned caching on as per documentation. But now I run into a different error when querying Athena. Any thoughts? I can open a different issue if it is helpful.

Error in .set_ops_arg_check(x, y, all, .seqn = TRUE) : Item 2 of x is 'sql' but the corresponding item of y is 'character'.
DyfanJones commented 2 years ago

hi @ramnathv are you able to provide the code that produced the error so i can try to replicate it :D

DyfanJones commented 2 years ago

hi @ramnathv I have managed to replicate the error you are coming across I will raise another issue so I can keep track of it, thanks for point this issue out 😄 .