DyfanJones / RAthena

Connect R to Athena using Boto3 SDK (DBI Interface)
https://dyfanjones.github.io/RAthena/
Other
35 stars 6 forks source link

"Error: Unable to locate credentials" when running `dbSendQuery` #98

Closed DavidArenburg closed 4 years ago

DavidArenburg commented 4 years ago

Hi, When running the following in Python, I get the credentials

import boto3
session = boto3.Session(profile_name = 'default')
credentials = session.get_credentials().get_frozen_credentials()
credentials

But when trying to run the following query in R

library(DBI)
library(RAthena)

con <- dbConnect(RAthena::athena(), 
                 region_name = 'us-east-1',
                 profile_name = 'default',
                 s3_staging_dir = paste(s3_bucket, s3_output, sep = "/"))
dbSendQuery(con, "SELECT * FROM ppc_pressqa.ppc_current_ib limit 10")

I get

Error: The config profile (default) could not be found Traceback:

  1. dbConnect(RAthena::athena(), region_name = "us-east-1", profile_name = "default", . s3_staging_dir = paste(s3_bucket, s3_output, sep = "/"))
  2. dbConnect(RAthena::athena(), region_name = "us-east-1", profile_name = "default", . s3_staging_dir = paste(s3_bucket, s3_output, sep = "/"))
  3. .local(drv, ...)
  4. AthenaConnection(aws_access_key_id = aws_access_key_id, aws_secret_access_key = aws_secret_access_key, . aws_session_token = aws_session_token, schema_name = schema_name, . work_group = work_group, poll_interval = poll_interval, encryption_option = encryption_option, . kms_key = kms_key, s3_staging_dir = s3_staging_dir, region_name = region_name, . botocore_session = botocore_session, profile_name = profile_name, . aws_expiration = aws_expiration, ...)
  5. tryCatch(ptr <- boto$Session(aws_access_key_id = aws_access_key_id, . aws_secret_access_key = aws_secret_access_key, aws_session_token = aws_session_token, . region_name = region_name, botocore_session = botocore_session, . profile_name = profile_name, ...), error = function(e) py_error(e))
  6. tryCatchList(expr, classes, parentenv, handlers)
  7. tryCatchOne(expr, names, parentenv, handlers[[1L]])
  8. value[3L]
  9. py_error(e)
  10. stop(py_err$value, call. = F)

If I'm not specifying the profile, I'm getting

Error: Unable to locate credentials Traceback:

  1. dbSendQuery(con, "SELECT * FROM ppc_pressqa.ppc_current_ib limit 10")
  2. dbSendQuery(con, "SELECT * FROM ppc_pressqa.ppc_current_ib limit 10")
  3. AthenaResult(conn = conn, statement = statement, s3_staging_dir = s3_staging_dir)
  4. tryCatch(response <- list(QueryExecutionId = do.call(Athena$start_query_execution, . Request, quote = T)$QueryExecutionId), error = function(e) py_error(e))
  5. tryCatchList(expr, classes, parentenv, handlers)
  6. tryCatchOne(expr, names, parentenv, handlers[[1L]])
  7. value[3L]
  8. py_error(e)
  9. stop(py_err$value, call. = F)

Just a note, I'm able to run this with AWR.Athena::Athena() successfully (it finds the credentials), e.g.

con <- dbConnect(AWR.Athena::Athena(), 
                   region = 'us-east-1', 
                   S3OutputLocation = paste(s3_bucket, s3_output, sep = "/"), 
                   Schema = 'default')
dbSendQuery(con, "SELECT * FROM ppc_pressqa.ppc_current_ib limit 10")

returns

<JDBCResult>

Thanks


aws configure list output

image


Session Info:

Session info --------------------------------------------------------------- setting value
version R version 3.5.1 (2018-07-02) os Windows 10 x64
system x86_64, mingw32
ui RTerm
language (EN)
collate English_United States.1252
ctype English_United States.1252
tz Asia/Jerusalem
date 2020-03-26

Packages ------------------------------------------------------------------- ! package version date lib source
arrow
0.16.0.2 2020-02-14 [1] CRAN (R 3.5.3)
assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.5.3)
AWR.Athena 2.0.7-0 2019-07-08 [1] CRAN (R 3.5.3)
aws.s3
0.3.20 2020-03-11 [1] CRAN (R 3.5.3)
aws.signature 0.5.2 2019-08-08 [1] CRAN (R 3.5.3)
backports 1.1.5 2019-10-02 [1] CRAN (R 3.5.3)
base64enc 0.1-3 2015-07-28 [1] CRAN (R 3.5.2)
bit 1.1-15.2 2020-02-10 [1] CRAN (R 3.5.3)
bit64
0.9-7 2017-05-08 [1] CRAN (R 3.5.3)
callr 3.4.1 2020-01-24 [1] CRAN (R 3.5.3)
cli 2.0.0 2019-12-09 [1] CRAN (R 3.5.3)
crayon 1.3.4 2017-09-16 [1] CRAN (R 3.5.3)
curl 4.3 2019-12-02 [1] CRAN (R 3.5.3)
data.table 1.12.8 2019-12-09 [1] CRAN (R 3.5.3)
DBI
1.1.0.9000 2020-03-04 [1] Github (r-dbi/DBI@500edd3) desc 1.2.0 2018-05-01 [1] CRAN (R 3.5.3)
devtools 2.2.2 2020-02-17 [1] CRAN (R 3.5.3)
digest 0.6.23 2019-11-23 [1] CRAN (R 3.5.3)
ellipsis 0.3.0 2019-09-20 [1] CRAN (R 3.5.3)
evaluate 0.14 2019-05-28 [1] CRAN (R 3.5.3)
fansi 0.4.0 2018-10-05 [1] CRAN (R 3.5.3)
fs 1.3.1 2019-05-06 [1] CRAN (R 3.5.3)
glue 1.3.1 2019-03-12 [1] CRAN (R 3.5.3)
htmltools 0.4.0 2019-10-04 [1] CRAN (R 3.5.3)
httr 1.4.1 2019-08-05 [1] CRAN (R 3.5.3)
IRdisplay 0.7.0 2018-11-29 [1] CRAN (R 3.5.3)
IRkernel 1.1 2019-12-06 [1] CRAN (R 3.5.3)
jsonlite 1.6.1 2020-02-02 [1] CRAN (R 3.5.3)
magrittr 1.5 2014-11-22 [1] CRAN (R 3.5.3)
memoise 1.1.0 2017-04-21 [1] CRAN (R 3.5.3)
pbdZMQ 0.3-3 2018-05-05 [1] CRAN (R 3.5.2)
pillar 1.4.3 2019-12-20 [1] CRAN (R 3.5.3)
pkgbuild 1.0.6 2019-10-09 [1] CRAN (R 3.5.3)
pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.5.3)
prettyunits 1.1.1 2020-01-24 [1] CRAN (R 3.5.3)
processx 3.4.2 2020-02-09 [1] CRAN (R 3.5.3)
ps 1.3.1 2020-02-12 [1] CRAN (R 3.5.1)
purrr 0.3.3 2019-10-18 [1] CRAN (R 3.5.3)
R6 2.4.1 2019-11-12 [1] CRAN (R 3.5.3)
Rcpp 1.0.3 2019-11-08 [1] CRAN (R 3.5.3)
remotes 2.1.1 2020-02-15 [1] CRAN (R 3.5.3)
repr 1.0.2 2019-12-16 [1] CRAN (R 3.5.3)
D rJava 0.9-11 2019-03-29 [1] CRAN (R 3.5.3)
RJDBC 0.2-8 2020-03-10 [1] CRAN (R 3.5.3)
rlang 0.4.5 2020-03-01 [1] CRAN (R 3.5.3)
rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.5.3)
sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.5.3)
stringi
1.4.6 2020-02-17 [1] CRAN (R 3.5.3)
testthat 2.3.1 2019-12-01 [1] CRAN (R 3.5.3)
tidyselect 1.0.0 2020-01-27 [1] CRAN (R 3.5.3)
usethis 1.5.1 2019-07-04 [1] CRAN (R 3.5.3)
uuid 0.1-2 2015-07-28 [1] CRAN (R 3.5.2)
vctrs 0.2.4 2020-03-10 [1] CRAN (R 3.5.3)
withr 2.1.2 2018-03-15 [1] CRAN (R 3.5.3)
xml2 1.2.2 2019-08-09 [1] CRAN (R 3.5.3)

DyfanJones commented 4 years ago

Hi @DavidArenburg, sorry about that. I will have a look to see why it is failing.

DyfanJones commented 4 years ago

@DavidArenburg does this not give you your credentials?

library(DBI)
library(RAthena)

con <- dbConnect(RAthena::athena(), 
                 region_name = 'us-east-1',
                 profile_name = 'default',
                 s3_staging_dir = paste(s3_bucket, s3_output, sep = "/"))

# method to get to boto3 to check if credentials have been passed correctly
con@ptr$get_credentials()$get_frozen_credentials()

Or is it returning nothing?

DavidArenburg commented 4 years ago

It fails before on the dbConnect with the

Error: The config profile (default) could not be found"

If I comment profile_name = 'default', I'm getting

Error in eval(expr, envir, enclos): attempt to apply non-function Traceback:

Without even the traceback

I have the suspicion that is due the fact that cat ~/.aws/credentials doesn't work in windows

DyfanJones commented 4 years ago

@DavidArenburg This is a strange error as I pass the profile_name directly to boto3.Session(). What is even more unusual is that your python example is working :S

Does the python work without profile_name being set so for example:

import boto3
session = boto3.Session()
credentials = session.get_credentials().get_frozen_credentials()
credentials
DavidArenburg commented 4 years ago

Yes

image

DyfanJones commented 4 years ago

@DavidArenburg are any aws environmental variables set?

DyfanJones commented 4 years ago

@DavidArenburg

Do you mind checking if the reticulate can get boto3 credentials:

library(reticulate)
boto3 <- import("boto3")
session <- boto3$Session()
credentials <- session$get_credentials()$get_frozen_credentials()
credentials
DavidArenburg commented 4 years ago

No, it gives the

Error: attempt to apply non-function

I remember having similar issues with reticulate before that it couldn't find Python at all (which doesn't make any sense to me at all). I've solved it by specifying the RETICULATE_PYTHON environment variable.

I guess it's a bug related to reticulate then, rather to you. Though it's strange it's can't find the credentials although I've set this variable.

DyfanJones commented 4 years ago

Hmm that is annoying if it is a reliculate issue. Can you try the environmental variable method to see if it will connect to Athena for you?

DyfanJones commented 4 years ago

@DavidArenburg the alternative it is use noctua. It does exactly the same as RAthena but it uses the R SDK into Amazon paws instead of boto3.

DavidArenburg commented 4 years ago

The reason I wanted to use your package is because I need to QueryExecutionId and I think it's only present in boto3. The reason for this, is when you run dbSendQuery it creates a file in S3 and I don't know the name of that file- hence, I can't delete it afterwards. That creates a problem, because the S3 buckets gets enormous amount of data that no one needs and can't be easily removed.

I Googled this issue extensively and couldn't find anyone even addressing this issue for some reason (maybe I've sued wrong keywords).

Eventually I saw that the file name of each file is identical to the QueryExecutionId, so you can use it in order to find and delete it from there.

If you have some easily alternative method for me, I would appreciate it.

In the meanwhile, I guess I'll need to sort out the Python issue with the reticulate package devs once and for all

Thanks

DyfanJones commented 4 years ago

@DavidArenburg Both RAthena and noctua return QueryExecutionId for example noctua

library(DBI)
library(noctua)

con <- dbConnect(athena())

res <- dbExecute(con, "select * from iris")

res@info$QueryExecutionId

*RAthena**

library(DBI)
library(RAthena)

con <- dbConnect(athena())

res <- dbExecute(con, "select * from iris")

res@info$QueryExecutionId

The both offer a method to clear down the AWS S3 bucket for adhoc queries

con <- dbConnect(athena())

res <- dbExecute(con, "select * from iris")
dbFetch(res)
dbClearResult(res)

dbClearResult actually clears down the AWS S3 resource by default. If you want to keep it then your can enable query caching the function *_options (where * is the package name).

For RAthena you can set the environmental variables instead dbConnect

Can you set them by using file.edit("~/.Renviron")

AWS_ACCESS_KEY_ID=<your key>
AWS_SECRET_ACCESS_KEY<your secert>

Make sure you reset your R kernel so R can pick them up.

DavidArenburg commented 4 years ago

Thanks, You haven't specified the method to clear down the AWS S3 bucket they offer :)

Also, there are other concerns, such as speed

DyfanJones commented 4 years ago

dbClearResult uses:

# remove metadata
s3$Object(s3_info$bucket, paste0(result_info$key, ".metadata"))$delete()

# remove athena output file
s3$Object(s3_info$bucket, result_info$key)$delete()
DavidArenburg commented 4 years ago

Interesting, I've used dbClearResult previously and it had no affect on the S3 bucket

DyfanJones commented 4 years ago

@DavidArenburg was that from RAthena or AWR.Athena?

DavidArenburg commented 4 years ago

AWR.Athena, as I wasn't able to get RAthena to work

DyfanJones commented 4 years ago

@DavidArenburg, For speed I have moved alot of the methods from calling AWS Athena directly to using AWS Glue. For example, listing tables is returned from AWS Glue instead of a query to AWS Athena, this cuts down from 3 second Athena call to .3 second Glue call.

The packages utilise data.table to parse data into R, alternatively you can set the backend to be vroom which does significant speed increases

DyfanJones commented 4 years ago

@DavidArenburg AWR.Athena uses the jdbc driver for Athena. AWS Athena alone can't delete S3 files. These packages can due to the nature of how they connect. The SDK's allow extra flexibility and give the full range of AWS Services for example S3

DavidArenburg commented 4 years ago

That's good to know. I guess I need to make your package to work then :)

DyfanJones commented 4 years ago

@DavidArenburg another note for speed is I have recently merged the latest PR #99. This latest addition uses ALTER TABLE when appending partitions to existing tables. Before it used MSCK REPAIR.

As ALTER TABLE only accounts the partition you are adding it will run alot faster specially when you are adding a partition to a highly partitioned table.

DyfanJones commented 4 years ago

@DavidArenburg try the environmental variable method highlighted above. If you come into errors, please reply to this thread and I will help you get up and running.

Plus if you notice it is running slower than the JDBC method please raise another ticket and i will see what I can do to help with performance

DavidArenburg commented 4 years ago

I've just checked the versions and they seem to match

image

image

DyfanJones commented 4 years ago

@DavidArenburg so reticulate is causing an issue on the same python. I might raise a ticket with reticulate to see if they know anything that could help

DavidArenburg commented 4 years ago

OK, please send me a link, I might add some info. I have a feeling that the package developers assuming that everyone installing Python using Anaconda.

DyfanJones commented 4 years ago

@DavidArenburg raised an issue on reticulate, please fill out any more information if have missed off https://github.com/rstudio/reticulate/issues/744

In the mean time let me know how the setting of credentials in the environmental variables goes.

DavidArenburg commented 4 years ago

I was able to use noctua successfully and it ran fast. Also, dbClearResult actually works, so this basically solves my problem. I'm closing this issue as it is clearly not related to your package. Let's hope reticulate devs will add support for people who don't use Anaconda.

Thank you for all the help and guidance.